Tyler Dukes has written a concise introduction to the issues with handwritten material and a lovely review of FromThePage at ReportersLab:
Even when physical documents are converted into digital format, subtle inconsistencies in handwriting prove too much for optical character recognition software. The best computer scientists have been able to do is apply various machine learning techniques, but most of these require a lot of training data — accurate transcriptions deciphered by humans and fed into an algorithm.
“Fundamentally, I don’t think that we’re going to see effective OCR for freeform cursive any time soon,” Brumfield said. “The big successes so far with machine recognition have been in domains in which there’s a really constrained possibilities for what is written down.”
That means entries like numbers. Dates. Zip codes. Get beyond that, and you’re out of luck.
I don’t know much about the world of investigative journalism, but it wouldn’t surprise me if it holds as many intriguing parallels and new challenges as I’ve discovered among natural science collections. Handwriting might still be the most interdisciplinary technology.