I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels. While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.
This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it. The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event. Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results. (See below for the transcript of my talk, see the notes document for descriptions of all talks.)
In my opinion, these preliminary talks were critical to the success of the project. The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort). On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches. This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper: "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'? We ran into that too, and here's what we did..."
The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients. In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other. One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do. Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR. My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR. A simple 1,2,3 workflow just isn't sufficient!
iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR. Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year. This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.