I was so impressed by Nicholas Woodward's presentation at TCDL this year that I asked him if I could share "Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text" on this blog.
Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.
Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.
While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.
These characteristics of AHPN informed my idea to develop a different method for transcribing a large scanned document collection that draws on the work of popular crowdsourcing transcription tools such as FromThePage, Scripto and Transcribe Bentham. These tools allow users to transcribe individual records of a collection, maintaining quality control largely through redundancy and error checking.
The only drawback from this approach is that the results of crowdsourcing are only applicable to the document being transcribed. In contrast, my approach looks to break up documents into individual words with the idea that though no two documents are exactly alike they are likely to contain similar words. And across an entire corpus, particularly very large ones such as AHPN, words are likely to appear many times. Consequently, if users transcribe the words of one document, then I can use image matching algorithms to find other images of the same words and apply the crowdsourced transcription to the new images. The process looks like this:
- Segment scanned documents into words
- Crowdsource the transcription of a fraction of those words
- Use image matching to pair images of transcribed images with unknown images
- In the case of a match, associate the crowdsourced text with the document containing the unknown image
Step 1 attempts to segment images like so.
The key point here is that due to a host of factors (smudges, speckles, light text, poor typewriters, etc.) the segmentation will not be 100% accurate. This is OK because the goal is really only to get as many words as possible, understanding that if we can successfully capture the type of terms that users of the collection will typically use to search, i.e. dates or names of people, places, and organizations, then the online version of AHPN will become much more useful than it is now. Here are a few typical examples of documents from AHPN with the segmentation algorithm I developed using OpenCV libraries.
The result is a folder of small images containing individual words.
The second step is to crowdsource the transcription of as many of these words as possible. I developed an online application using CakePHP and MySQL that allows users to select documents and then transcribe the words they see.
The crowdsourcing output from the testing phase of the project looked like this.
Because there were so few images per word, I abandoned the original plans to create a multi-class SVM classifier that would find matching words in bulk. Instead, I drew on OpenCV again to develop a workflow for image matching that consisted of six steps.
- Compare the width and height of each image
- Compare the histograms of each image
- Extract ‘interest points’ of each image using SURF algorithm
- Match interest points between them
- Calculate average distance between interest points
- If the average is below a certain threshold, consider the images a match
The first step is fairly intuitive. Longer images likely contain words with more letters than shorter images, and a different number of letters indicates they are clearly not the same word. This step is relatively basic, but it turns out that in many, many cases it obviates the subsequent more computationally intensive work.
The second step is also relatively straightforward. Image histograms are just the counts of pixels of different hues in an image. In a typical color image, a histogram could be the number of pixels of each color. In this case the scanned documents are grayscale TIFF images, and calculating histograms is simply a matter of adding up all the black and white pixels separately (See below). Two images of different words may contain the same number of letters but the letters are different shapes and so their ratio of black to white pixels will also differ. Note: this does not solve the problem of words with similar letters in a different order, e.g. “la” and “al”. This case is handled in the next step.
Steps 3 through 6 represent the most computationally intensive part of the entire matching process, and they are based on the Speeded Up Robust Features (SURF) implementation in OpenCV. The basic idea is to use the SURF algorithm to find the “interest points” of an image and then measure the distance between points in two separate images. Interest points are curves, blobs and points where two lines meet. Going this route is both a strength and weakness to my approach. I’ll explain. Even if we don’t speak Spanish, it’s relatively easy to infer that the images below contain the same word, just in a different format. Whether they’re in all caps or all lowercase or the first letter is capitalized, the word is the same to a human.
But not to a computer.
And we see this when we use SURF to calculate the interest points of each image.
The final step to matching images involves computing the distance between interest points and finding the matches below a certain distance. In this case, I used the ratio test to eliminate poor interest point matches and then considered to images the same if their average distance was below .20. The example below attempts to match the word “placas” (plates) with several other words, and finds a match with another image of the word placas, despite the poor quality of the second image.
The point here is that we will need to match each of these images separately, even though they contain the same word. This means we’ll need more images to train a classifier (or image match), and we will have to find every possible way that a word (WORD, Word, word, etc.) appears in a corpus and crowdsource the transcription of at least a few of them before we classify unknown words. Ergo more time and work to finish.
But here is how it’s also a strength of the approach in at least two ways. The images above are Spanish words, but to a computer they’re just objects. They could really be in any language and the same basic workflow would apply. The only adjustments would be to crowdsourcers who understood the language (helpful for determining the word when it’s hard to make out and context helps) and tweaks to the algorithm parameters in steps 3-6. Second, while the format of the word is not necessarily important for keyword searches, it is relevant to many computation research methods such as document clustering, classification and finding named entities.
The output of this approach on the test collection from AHPN is as follows:
There are several implications from these results. First, shorter images, i.e. one or two letters, basically bring us back to the same issues of OCR. In the case of AHPN the quality of the scanned documents varies, and in many instances there are just not enough interest points to differentiate between ‘a’, ‘o’ and ‘e’. So my approach may only be suitable for longer words. This may not be too serious of an issue since in most cases short words are also stopwords, which are generally not needed for either keyword searches or computational research. Second, without a doubt, I will need to do the crowdsourcing on a much larger scale. 2.33 images per word is not enough training data to create an image classifier capable of finding other images of that word. And there is also a greater need for transcribing output because this approach requires images of each word in all of its forms, along with the transcribed text.
The future steps of the project, then, must include an avenue for acquiring more crowdsourced transcription. I think the best approach for this will be to refactor the CakePHP code into modules for Omeka, Drupal and other popular open source CMSs. Similar to the crowdsourcing tools mentioned above, Scripto, FromThePage, etc., I’ll need to enlist the users of particular collections who may have the most vested interest in seeing them become more accessible and functional. Another important component will be to continue development on the algorithms for image segmentation and matching. I am interested in looking at how to segment handwritten text, especially cursive text where it is particularly difficult to determine the spaces between words. Additionally, the image matching detailed above may be useful going forward because it is not particularly memory intensive and the work can be divided into separate tasks for a distributed computing environment, maybe something along the lines of SETI@Home. But with more output from crowdsourcing it would be good to incorporate either a Bag of Words or SVM classifier approach to process more words at a time. So there’s definitely plenty to do. Stay tuned!