• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar

FromThePage Blog

about crowdsourcing, manuscript transcription, digital humanities and digital documentary editions

  • Home
  • Project Profiles
  • Interviews with Clients
  • Collections
  • Back to FromThePage

iDigBio Augmenting OCR Hackathon

February 15, 2013 By Ben Brumfield

I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels.  While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.

This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it.  The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event.  Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results.  (See below for the transcript of my talk, see the notes document for descriptions of all talks.)

In my opinion, these preliminary talks were critical to the success of the project.  The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort).  On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches.  This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper:  "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'?  We ran into that too, and here's what we did..."

The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients.  In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other.  One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do.  Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR.  My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR.  A simple 1,2,3 workflow just isn't sufficient!

iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR.  Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year.  This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.

Filed Under: hackathon, ocr

Primary Sidebar

What’s Trending on The FromThePage Blog

  • How to Learn to Read Shorthand
  • Interview: Dr. Laura Morreale on Teaching and…
  • Project Profile: Sewanee Project on Slavery, Race…
  • Survey on Crowdsourced Transcription Tools
  • UI and Other Fun Stuff
  • Prosopography Hackathon Project: Using Machine…

Recent Client Interviews

An Interview with Erin Wilson of Ohio University Libraries

An Interview with Susannah Ural of the Civil War & Reconstruction Governors of Mississippi Project

An Interview with Olivia Carlisle of the State Archives of North Carolina

An Interview with Paige Roberts of Phillips Academy Archives & Special Collections

An Interview with Riley Bogran of the Sandy Spring Museum

Privacy Policy | Terms & Conditions | About Us | Contact Us

Copyright © 2021 · FromThePage.com