At a Glance: The Differences Between FromThePage and Transkribus
- FromThePage is a crowdsourcing platform for transcribing handwritten and typewritten manuscript materials that harnesses the unique power and skills of humans to provide transcripts of documents. FromThePage can handle materials that need more unique human intervention, that are more complicated and require more skill to transcribe.
- Transkribus is software that uses artificial intelligence and machine learning processes to automatically create transcripts. This requires a large corpus of manual transcription to train their AI to your particular project. The site says it requires roughly 50 pages of manual transcription, but can be a smaller or larger sample for small projects.
- Transkribus works best on Large collections in very clear, consistent handwriting that require less human expertise. FromThePage is a great choice for complex documents like ledgers and logs that may have inconsistencies and diverse formats that aren’t constant through a collection.
- Layout detection is hard for any transcription platform—automated or not. FromThePage depends on human encoding of textual elements while Transkribus relies on AI detection. As such, FromThePage may help with these more complex materials that have challenging or inconsistent layouts.
- FromThePage supports the creation of descriptive metadata, so it can benefit more than just transcription and has built-in flexibility
- FromThePage supports public engagement and interaction with materials, through crowdsourcing to individuals beyond institutional walls
FromThePage software exists to facilitate the transcription of manuscripts and other materials via careful, human crowdsourcing. FromThePage started when co-founder Ben Brumfield received his great-great-great grandmother’s diaries and wanted to collaboratively transcribe them with family members. As a result, FromThePage has always prided itself as a way to harness the power of human transcription skills and to create meaningful experiences and access to manuscripts.
Transkribus is a software transcription platform that features Handwritten Text Recognition (HTR), a technology that uses Artificial Intelligence (AI) to automate the transcription of handwritten materials. Transkribus, developed in 2020 by the University of Innsbruck in collaboration with READ-COOP, allows users to train models based on handwritten collection materials, which can then be used to automate collection transcription. By training and automating these processes through Transkribus, time can sometimes be saved on projects. But this pathway to transcription is not a one-size-fits-all solution, and there are times when automated transcription may not be the best choice for your projects.
Recently on Twitter, Ben Brumfield discussed some of the benefits of Transkribus and FromThePage, and when one might be a better solution. FromThePage does not rely on automation, but rather the work of skilled volunteers and collaborators transcribing documents and even creating metadata for materials. FromThePage is a great choice for projects that need a high amount of human eyes and expertise, like complex handwritten documents, ledgers and logs in table format (even those with skewed/uneven column and row/table structures), and materials that aren’t consistent in format throughout the work. FromThePage can also be used for collections of typewritten materials.
In contrast, Transkribus, as Ben points out, can work for large projects that are simple and more homogenous in nature, like typed documents, or projects that require little human mark-up or correction. In the conversation on Twitter, Ben and Tobias Hodel discussed how HTR is promising as a norm for some materials. Tobias says that “For large (rather unspecific textual entities) HTR will become the norm, while scholars focus on highly interesting parts/texts.”
Tables are another consideration in deciding what transcription tools can benefit your project. Transkribus can handle column structures, but is currently not strong at handling tables, as Quinn Dombrowski points out. Tobias adds that using Transkribus for tables requires work outside of Transkribus and careful scanning and document positioning. Quinn adds that “Layout is hard. The difference in quality I've seen between different OCR programs these days often has more to do with layout detection than character recognition.”
Transkribus can only recognize the text in a table and transcribe it as unstructured text, while one of FromThePage’s useful features is the ability to index complex tables with rows and columns. Frederik Elwert added to the conversation that “Table recognition is mandatory if we want to digitize historical datasets *as data*, not as unstructured text.” You can have transcribers fill out columns with exactly what they see, or take a more “accurate” approach, for example, having transcribers select standardized state abbreviations from a drop down or comply with standard date formats. Or perhaps, you’re only collecting information from a particular column on a table. FromThePage is flexible to your project’s needs and provides comprehensive and accurate transcriptions.
In addition to crowdsourcing transcription FromThePage also supports metadata generation, so it can be beneficial to project workflows in a very comprehensive manner. Some projects use FromThePage internally to generate standardized metadata using forms that correspond with each uploaded manuscript image (Read about how Ohio University libraries is doing this). HTR can only be beneficial with the text it sees on the page. Metadata expounds beyond the words on the page and as such, really needs the expertise of a human to be successful. Crowdsourcing can be a great way to improve your projects metadata in accordance with your institutional standards.
As a crowdsourcing platform, another tenet of FromThePage is open support of collaboration and public engagement with materials through transcription. Allowing any individuals to access and transcribe materials supports public engagement and increases accessibility. This engagement also supports lifelong learning, allowing for educational transcription experiences of individuals from various backgrounds. Projects on FromThePage can also be kept private to an institution if that is a better fit for project needs.
Overall, when it comes to tables, metadata needs, and other transcription challenges, you have to evaluate the needs and goals of your project. Do you need unstructured text, or is a column/table format and the quality of structured, indexed data an important and necessary component? Like Ben Brumfield wrote about in an earlier blog post, you have to consider and decide on the quality you are looking for in a project’s output. Is your project more about fidelity to the document text (type-what-you-see), or usability (for example, gaining accurate indexing following controlled vocabularies to support data needs)? You have to think about your workflow and your project’s end goals. What do you hope to generate? Where will generated transcription data end up and what systems do data need to be compatible with? Is this a good project for public engagement?
Some projects can potentially benefit from a workflow that consists of FromThePage and Transkribus. Transkribus is automated, but requires manual, human transcription on the front-end of a project before AI can be deployed to automate transcription. This is where FromThePage can come in. Rather than one person or a small team transcribe materials, FromThePage can be used to crowdsource the initial training batch of manually transcribed materials. This output can be downloaded and used to train Transkribus models.
This blog post covers two transcription technologies that are developing very rapidly, in two fields—crowdsourcing and Handwritten Text Recognition—that are also rapidly growing and changing. This post compares the two software projects as of March 2022; however, both are undergoing regular development with frequent updates. Please check with the FromThePage and Transkribus teams for the current state of both projects. The chart below provides an overview of the strengths of both software projects and can help you assess which tool can best help you reach your transcription goals. If you are interested in FromThePage or have any questions, please reach out to us at support@FromThePage.com.
|Uses human expertise for all transcriptions||✅||🚫|
|Platform supports crowdsourcing and collaborative transcription||✅||🚫|
|Uses artificial Intelligence, machine learning, and automated processes||🚫||✅|
|Can be part of an artificial intelligence, machine learning, or other automated processes workflow||✅||✅|
|Handles transcription of unique and complex handwritten documents||✅||🚫|
|Supports accurate transcription and indexing of tables and ledgers||✅||🚫|
|Supports simple column structures||✅||✅|
|Supports transcription of simple handwritten cursive, handwritten non-cursive, or typed documents||✅||✅|
|Supports complex and inconsistent document layouts||✅||🚫|
|Supports crowdsourced metadata creation||✅||🚫|
|Supports and encourages public engagement with materials||✅||🚫|