Two weeks ago, I looked at Ben and said "What we obviously should build next is field-based AI-Assist." Transcribing forms or "spreadsheet-like" ledgers and rolls is not as much fun as transcribing letters or field books, so anything that makes the process easier might appeal to volunteers. And because the text is more "data" than "text", it should suffer less from plausibly seductive AI hallucinations than narrative material.
Then Ben suggested "what if we ran pages through two different models and compared the results? We could highlight places where two models differed in their interpretation of the text on the page, and ask volunteers to be the tie breakers."
In AI terms this is called "consensus based validation," although usually folks throw a lot of models at the task and forget that humans might be better tie breakers.
Lucky for us, the perfect project arrived during these conversations.Nick Zmijewski at the Industrial Archives & Library, emailed me, “We have a bunch of these engineer designs and they all have pretty standard data, written in clear draftsman hand, in a box on the bottom right. Is there a way we could automate this?"
AI experiments are fun, so we did two. The first with our colleague Mike Cooper-Stachowsky, who ran our three sample cards through the open source model Qwen-VL 72b. I ran the same cards through Gemini 2.0 Flash. Then we entered each of the results of each run into FromThePage as if it was a manual transcription, so we could use FromThePage's "versions" tab for a page to see the differences between each model.
Here's what we learned:
The results from both models were really good. Not perfect, but surprisingly high quality.


Even with the surprising quality, our transcriptions differed quite a bit between models. If we were to flag the pages with any differences for human review, we'd end up flagging every single page. This suggests that some projects that want to automate more should simplify the task by reducing the number of fields collected. -- For example, if "scale" wasn't an important piece of metadata to collect, leaving it out would increase the consensus of the two models, since that was an area they often disagreed on.

FromThePage "versions" tab showing differences between the two models
For this experiment, the Gemini model was better than the open source model, filling in more fields and producing fewer character errors..
Sometimes, the results differed between different runs of the same image against the same model. The differences tended to be the harder- to- read fields like names, so I thought it was a good measure of "uncertainty" -- if the same "brain" (LLM) interprets the same text differently from one moment to the next, wouldn't that mean these fields were more difficult? That suggests that using this sort of model "voting" might work even with just one model.
We’re excited about this approach of machines and humans collaborating, with machine generated results pointing humans to the parts of projects that most need judgement and interpretation.
Have a similar project you’d like to us to experiment with? Let us know!
