• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar

FromThePage Blog

Crowdsourcing, transcription and indexing for libraries and archives

  • Home
  • Interviews
  • crowdsourcing
  • how-to
  • Back to FromThePage
  • Collections

Crowdsourcing OCR Correction vs. Manuscript Transcription

February 8, 2018 by Ben Brumfield

(This is a repost of a reply to a query about crowdsourced OCR correction on the Museums Computer Group list.  You can find the original here.)

Many people have pointed to some excellent crowdsourcing platforms, but I wanted to weigh in on OCR correction as a crowdsourcing task distinct from manuscript transcription.  Some excellent platforms (including ones suggested) won't be able to handle OCR without substantial programming effort.

I've been adding OCR correction functionality to our own open-source crowdsourcing platform FromThePage over the last three years, and have found that it's best to be very careful about answering "does this platform do OCR correction?" with a simple "yes" or "no".  The problem is that OCR correction has far more complex integration needs than manuscript transcription, and different communities of practice have radically different implicit assumptions about technical formats.

Any crowdsourcing system will need to ingest not only images and metadata, but also the raw OCR text you hope to correct.  It will also need to associate that text with each page, so that both can be presented to users for correction.  Finally, it will need to export the text in a usable format, and the options for that format may be constrained by the affordances of the crowdsourcing interface.

Let me give a few examples.  In one case, we were asked to handle crowdsourced OCR correction of books hosted on the Internet Archive for a scholarly editing project.  Until recently, the Internet archive ran a white-label version of ABBYY FineReader against scanned books, and made the .DjVu files of raw OCR text available.  It was straightforward for us to read the .DjVu file, correlate each PAGE element of its XML with the page images we present to users, and convert its text to the wiki-text  format our users edit.  Once corrected, our existing export formats (at the time, TEI and XHTML) were what the editors needed for their publication.

In another case, we were approached by a project that had page images and raw OCR text in a Word document.  Because the OCR process they'd used had merged the text for all pages into a single file, there was no way to correlate the raw text with the images.  In the end, the project had to run a manuscript transcription project in which staffers cut and pasted the text from their document into each page as the initial transcription.

In a third case--one I was working on before I started this email--we were asked to support OCR correction for the CONTENTdm digital library system.  The CONTENTdm API supports reading and writing plaintext transcript per page, so we're able to do the integration in a totally automated fashion, reading metadata, images, and OCR text from the institution's CONTENTdm installation, presenting them to users in our interface, and updating the institution's records in situ when the correction is done.

If we were presented with ALTO files to correct tomorrow, we'd need to write code to ingest them into the platform, even though we already support a couple of other OCR formats.  If we also had to produce ALTO from the corrected text (rather than plaintext, TEI, or HTML) it would be a challenge, since ALTO files have bounding boxes for each word, and these are lost during our page-at-a-time correction process.  (It's not impossible, just difficult.)

To sum up, you really have to ask of any crowdsourcing platform,

  • How does my data (page images, presentational/structural metadata, and raw OCR text) get ingested into the platform, especially given the formats it is currently in?
  • How does my data (corrected transcripts) get back out of the platform in a format I can use?

I'm always happy to chat with people about manuscript transcription or OCR correction projects and platforms.  You can contact me via email at benwbrum@fromthepage.com.

Filed Under: Uncategorized

Primary Sidebar

What’s Trending on The FromThePage Blog

  • An Interview with Stevy Acevedo of Gilb Museum
  • Facebook versus Twitter for Crowdsourcing Document…
  • How Good HTR is Changing What & How We’re Transcribing
  • How Do I Read Old Handwriting?
  • Guide to Digitizing Your Archives
  • Introducing Gemini 3.0 Support in FromThePage

Recent Client Interviews

An Interview with Candice Cloud of Stephen F. Austin State University

An Interview with Shanna Raines of the Greenville County Library System

An Interview with Jodi Hoover of Digital Maryland

An Interview with Michael Lapides of the New Bedford Whaling Museum

An Interview with NC State University Libraries

Read More

ai artificial intelligence crowdsourcing features fromthepage projects handwriting history iiif indexing Indianapolis Indianapolis Children's Museum interview Jennifer Noffze machine learning metadata newsletter ocr paleography podcast racism Ryan White spreadsheet transcription transcription transcription software

Copyright © 2026 · Magazine Pro on Genesis Framework · WordPress · Log in

Want more content like this?  We publish a newsletter with interesting thought pieces on transcripion and AI for archives once a month.


By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.  We never sell your information.