This is my fourth and final post about the iDigBio Augmenting OCR Hackathon. Prior posts covered the hackathon itself, my presentation on preliminary results, and my results improving the OCR on entomology specimens. The other participants are slowly adding their results to the hackathon wiki, which I recommend checking back with (their efforts were much more … [Read more...] about Detecting Handwriting in OCR Text
This project attempted to improve the quality of OCR applied to difficult entomology images[*] by cropping labels from the images to run through OCR separately. In order to identify labels on the image to crop, an initial, 'naive' pass of OCR was made over the whole image, generating bothA) a set of rectangles on the image defined as word bounding boxes by the OCR engine, … [Read more...] about Results of the "Ocrocrop" Approach to Improving OCR
I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels. While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post … [Read more...] about iDigBio Augmenting OCR Hackathon
This is a transcript of my talk at the iDigBio Augmenting OCR Hackathon, presenting preliminary results of my efforts before the event.For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting. To … [Read more...] about Improving OCR Inputs from OCR Outputs?