Improving OCR using FromThePage

This is a response to the recently published "A Research Agenda for Historical and Multilingual Optical Character Recognition" by David A. Smith and Ryan Cordell, with the support of The Andrew W. Mellon Foundation. The report analyzes current challenges faced by humanities researchers using OCR text and outlines important avenues for research to improve OCR quality. In many places, the report calls for transcription systems and crowdsourcing experts. Since we have run crowdsourced transcription systems for more than a decade, we have a lot of ideas on how transcription systems could be used as part of the solution. Our ideas are specific to FromThePage, our platform, but many could apply to other tools. We thought the best way to be part of the conversation was to share our ideas publicly. If you're working on solving the problems outlined in the report, we'd be very interested in collaborating with you.

The report has nine recommendations; our thoughts will be organized accordingly after introducing FromThePage.

About FromThePage

Before we dive into potential implementations, an introduction to FromThePage is in order. FromThePage is a collaborative transcription system used by libraries, archives, museums and academic researchers. It is open source software (available under the Affero GPL 3.0 license) and is deployed at the University of Texas, Fordham University, Northwestern University, and Carleton University among others. We also offer FromThePage as a software-as-a-service. For $1000 to $5000 per year, we host projects for organizations and individuals on shared servers that we run and maintain; this shared infrastructure approach means that projects can be up and running in hours without technical expertise.

While FromThePage is most popular as a crowdsourced transcription platform, it is also in use as an OCR correction platform. Our current OCR ingestion integrations are from the Internet Archive (i.e. the LatAm project) or ContentDM (i.e. Indianapolis Public Library). Our multilingual support means we host projects in a wide variety of languages: Old French, Spanish, Malay (Jawi), Arabic, Nahuatl, Mixtec, Urdu, and Dutch. That support includes right-to-left script support (top to bottom is coming soon). We also support collaborative translation of transcribed texts.

FromThePage has been in ongoing development for thirteen years; our first deployment was for San Diego Museum of Natural History in 2010. FromThePage.com, the software-as-a-service solution, has been in use since 2011. Our approach to sustainability is a combination of SaaS subscriptions and "sponsored development" -- we collaborate with institutions and individuals to build needed features. Recent examples include right-to-left script support sponsored by the British Library and field-based transcription and ContentDM integration sponsored by the Council of State Archivists. While both the SaaS subscriptions and feature development are often funded by grants, the shared cost approach means that no one grant program, institution, or granting organization carries a large burden for software development. All new features are available to all users of FromThePage.

OCR Improvement Ideas

Recommendation #1: Improve Statistical analysis of OCR output

"Inferred" OCR quality statistics can be approached through entirely computational models, but we suspect that a human identifying and correcting a small portion of an OCR'd text -- 5-10 exemplar pages or 1000 lines -- could provide better input into OCR quality. A scholar considering 3 or 4 different corpora might be motivated to transcribe those exemplar pages for each different sets of materials; a comparison of before-and-after versions of the text could lead to an informed statistic of how good or bad the OCR for each was. Since the scholar’s corrections provide the gold standard--quality statistics can be calculated for texts with non-standard orthography like early modern printed work or multi-lingual texts. FromThePage currently keeps page transcription versions in our database and presents a “diff” view to end users, but would need to be modified to count corrections and calculate error rates in the uncorrected OCR.

To take this idea even further, if a researcher spent the time to correct some number of pages of a low quality OCR text, those corrections could be used to retrain an OCR engine as each page is completed. The retrained model could be applied in batch to a similar corpus. Better yet, the retained model could be applied to all the subsequent pages in the corpus. The result would be a virtuous cycle of OCR text that continually improves as it is corrected, needing fewer corrections the further the editor works through the text. This emergent model of OCR correction retraining and application could be integrated into a standalone service accepting contributions from many editing platforms or could be integrated directly into an editing platform like FromThePage. The labor to correct the OCR in this model is very motivated by the immediate improvement in the text they are correcting and by the ability to improve the OCR text of their particular corpus.

Recommendation #2: Formulate standards for annotation and evaluation of document layout

We don't see that FromThePage is a good match for this recommendation, but we would encourage others to look into the work being done by the IIIF Newspapers Community Group for open standards to support this task.

Recommendation #3: Exploit existing digital editions for training and test data

We think that there are a lot of possibilities here. We already have digital edition projects that start with OCR correction, like the James Malcom Rymer Collection. We also have crowdsourced transcription of typewritten text in projects like the Papers of Julian Bond. These transcription/edition projects--and the more straightforward OCR correction crowdsourcing projects--currently produce edition text as outputs, but the act of correction (and creation of human-edited text) can be leveraged to produce training data for OCR engines, or even trained models as a separate export.

Recommendation #4: Develop a reusable contribution system for OCR ground truth

This recommendation is where FromThePage is already in use, and can easily continue to play this role. In 2018, the British Library identified a need for an Arabic manuscript ground truth dataset to improve HTR models. They used FromThePage to crowdsource the transcription of 85 pages of Arabic Scientific Manuscripts.

Volunteers have also expressed a preference for transcription rather than OCR correction.

Recommendation #5: Develop model adaptation and search for comparable training sets

(no comment)

Recommendation #6: Train and test OCR on linguistically diverse texts

(no comment beyond FromThePage's aforementioned support for a wide variety of languages.)

Recommendation #7: Convene OCR Institutes in critical research areas

Both crowdsourcing for the creation of ground truth data sets and OCR correction takes motivated groups of people, be they scholars, students, or the public. Building that community and using them across similar projects that they are motivated to work on is not a trivial undertaking, but would be a key task of domain centric OCR institutes. We often refer to the medievalists as "early adopters" in the digital humanities, but we saw this sort of community build as the Parker Library at Corpus Christi College, Cambridge put their Anglo Saxon manuscripts online for transcription and reached out in a variety of ways to their scholarly community.

Hosting all the projects on a central platform like FromThePage that facilitates different types of projects -- public and private, transcription or correction -- and makes it easy to find a project of interest and share it would make sense. It would also give the community manager (a required role for any institute) a centralized way to coordinate and communicate with volunteers/contributors.

Recommendation #8: Create an OCR assessment toolkit for cultural heritage institutions

(no comment)

Recommendation #9: Establish an "OCR Service Bureau"

Using FromThePage for OCR correction and ground truth dataset creation could save a tax-payer money in the creation of such a service bureau.

IIIF

Although it isn't mentioned in the OCR report, we believe IIIF is the right solution to moving page images and text through data pipelines.

We have thought quite a bit about how you attach text (in many formats -- plaintext, HTML, TEI, optimized for search or analytics) to page images in IIIF. The FromThePage API provides one example on how you can do this; our approach is shared by Jeffrey Witt’s Sentences Commentary Text Archive. We would encourage any system implementers for OCR improvement to use IIIF to transport page images and text together. The IIIF Image API works allows you to target specific regions of an image (say a line) with a specific body (say the transcription of it) in a web annotation, and also to link full page images to external resources like ALTO files.

A discussion with a scholar recently about working with OCR'd text of early international law broadsheets from the Bibliothèque nationale de France made us realize that exposing the original OCR format (i.e. ALTO) as a seeAlso link in the IIIF manifest would give scholars the ability to go back to the original OCR bounding boxes and perform their own transformations specific to their unique text and projects. In other words, until we "solve" the problems with OCR of historical texts, exposing raw OCR files is one way to increase the number of approaches to processing and improving OCR output.

We'd suggest looking at the following resources for thinking about IIIF in the context of OCR improvement and text processing pipelines:

Jason Ronallo and his team at North Carolina State University Libraries built Ocracoke, an OCR pipeline built on IIIF.
"Round Trip to Paradise" -- presentation on how John Howard at University College Dublin is using FromThePage's IIIF API to "roundtrip" from Fedora into FromThePage and back.
IIIF runs on community groups, and each group has different domain specific knowledge about IIIF. For the sorts of work we are talking about here we'd look into the IIIF Newspapers group, the IIIF Archives group, and the IIIF Text Granularity group.