How Do You Know Whether AI Is "Good Enough"?

Yesterday we deployed two new features to help you evaluate Gemini 3 (and eventually others) results against human transcribed or corrected text.

First, we’ve developed a comparison screen that shows the differences between an AI generated page transcription and human created ground truth:

Next, we calculate statistics, again comparing the AI draft against the human ground truth.

This gives you data – which we think is a good start – but how do you decide if it’s good enough? What’s good enough? Again, we look to Mark Humpries, who recently published a very thorough analysis of Gemini 3’s quality. Here are his three levels of quality using character error rate, suggesting the reliability of each and the level of human intervention which would be necessary at each level. (As someone who runs a human centered software platform, I am relieved there is still a place for us!)

A 3% error rate equates to about 3-4 errors per sentence, making the document a first draft at best but also fundamentally untrustworthy. An error rate of 1% means around one error per sentence, readable but still in need of significant and close proof reading. At 0.5%, a document becomes both usable and trustworthy with around 1-2 characters wrong on each page. If one planned to publish such a document, careful proof reading would still be necessary, but it would be more akin to copy-editing than re-interpretation."

It’s also interesting to look at the types of errors – a lot of differences in the screenshot above (from a letter in the Hagley Museum Archives) are not errors so much as differences in transcription choices. Capital letters instead of lower case (or vice versa) or how many dashes are used to represent a strikethrough.
We do know we’ll need to iterate on our prompt so Gemini’s output more closely matches our default transcription conventions. We’re also going to add projects’ custom transcription conventions to the default prompt and see how that goes.

The most reassuring thing in Mark’s analysis – reinforced by the results we’re seeing in the Gemini transcriptions and comparisons in FromThePage – is that Gemini doesn’t make things up:

The most remarkable thing, though, is that Gemini is so often able to push past the ruts created in training that want to steer it towards correcting historical spelling errors and capitalizations. Most of the time—99% in fact—it succeeds.
Hallucinations were entirely absent. By hallucinations, I mean insertions or replacements that are not derived from the text."

And that, more than anything else, is why we think this is a viable solution for archives.

Want to dive deeper?

Here are two public projects with Gemini transcription and human ground truth:

Wood Family Letters from the Hagley Library

Various inventions by Wilber Moore Stilwell and Gladys Ferree Stilwell from the University of South Dakota

Join our webinar on our Gemini 3 integration next Thursday, December 11th.
Start your own 200 page trial and click the “Generate AI Drafts” button as you import your own material.