This work will fund important enhancements to FromThePage.
While FromThePage supports transcribing text in any Unicode-supported script, the application interface itself has only been in English. We have found that this limits contributions from non-native English speakers. For instance, a Mixtec transcription project received many more contextual notes from transcribers once it was emphasized that notes could be left in Spanish. Even the labels on fields and buttons biased the users interactions in ways that privileged English.
The grant provides funds for the translation of the FromThePage user interface to both Spanish and Portuguese, which will allow transcribers in Latin America to use FromThePage in their own languages. The technical work to support this involves extracting all the application text into message catalogs. Once this infrastructure is in place, the open-source community can contribute translations of the interface into other languages.
New Export Formats
Another major part of this grant will fund enhancement to export formats from FromThePage to make crowd contributions more easily usable in scholarly publications and library systems. The FromThePage project owner community will be surveyed to determine export format priorities, but we anticipate that these formats may include:
It’s interesting to look back at 2010 to see what’s happened to the projects I thought were major developments:
FamilySearch Indexing remains the largest crowdsourcing transcription project. In the 2010s, they upgraded their technology from a desktop-based indexing program to a web-based indexing tool. Overall numbers are hard to find, but by 2013, volunteers were indexing half a million names per day.
In 2010, the Zooniverse citizen science platform had just launched its first transcription project, Old Weather. This was a success in multiple ways:
The OldWeather project completed its initial transcription goals, proving that a platform and volunteer base which started with image classification tasks could transition into text-heavy tasks. OldWeather is now on its third phase of data gathering.
Zooniverse was able to use the lessons learned in this project to launch new transcription projects like Measuring the ANZACs, AnnoTate, and Shakespeare’s World. They were eventually able to include this type of crowdsourcing task in their Zooniverse Project Builder toolset.
The Scribe software Zooniverse developed for OldWeather was released as open source and–after a collaboration with NYPL Labs–became the go-to tool for structured data transcription projects in middle part of the decade, adopted by University of California-Davis, Yale Digital Humanities Lab, and others.
The North American Bird Phenology Program at the Patuxent Wildlife Research Center was apparently completed in 2016, transcribing nearly a million birdwatcher observation cards. Unfortunately, its digital presence seems to have been removed with the retirement of the project, so its fantastic archive of project newsletters is no longer available to serve as a model.
Our own open-source FromThePage software has been deployed at university libraries and botanical gardens around the world and is used on materials ranging from Arabic scientific manuscripts to Aztec codices. Statistics are hard to come by for a distributed project, but the FromThePage.com site currently shows 455846 transcribed pages, which does not include the 100,000+ pages for projects removed after completion.
Transcribe Bentham has continued, reaching 22740 pages transcribed in December 2019. The team continues to produce cutting-edge research about crowdsourcing and ways that crowdsourced transcription can provide ground-truth data for machine learning projects.
Developments in the 2010s
What are the major developments during the 2010s?
In 2010, crowdsourcing was considered an experiment. Now, it’s a standard part of library infrastructure: Seven state archives run transcription and indexing projects on FromThePage, and some (like the Library of Virginia) run additional projects on other platforms simultaneously. The British Library, Newberry, Smithsonian Institution, and Europeana all run crowdsourcing initiatives. Crowdsourcing platforms integrate with content management systems (like FromThePage with CONTENTdm or Madoc with Omeka S) and crowdsourcing capabilities are built directly into digital library systems like Goobi. Library schools are assigning projects on crowdsourcing to future archivists and librarians.
Perhaps the best evidence for the maturity of the methodology is this announcement by Kate Zwaard, Director of Digital Strategy at the Library of Congress, writing in the December 2019 LC Labs newsletter:
This time of transition of seasons also brings a new beginning for By the People. It is graduating from pilot to program and moving, along with program community managers, to a new home in the Digital Content Management Section of the Library. LC Labs incubated and nurtured By the People as an experiment in engaging users with our collections. Success, though, means it must move as it transitions to a permanent place in the Library.
The By the People crowdsourcing platform had started as an experiment within LC Labs, was built up under that umbrella by able veterans of previous projects, and is now simply Library of Congress infrastructure.
One of the ways the field matures is through personnel moving organizations, bringing their expertise to new institutions with different source materials and teams. We’ve seen a number of early pioneers in crowdsourcing find positions in institutions that hired them for that experience. Those institutions’ ability to commit resources to crowdsourcing projects and the crowdsourcing veterans’ discoveries as they work in different environments or with different materials have combined to push the methodology beyond the fundamentals.
This may seem a bit vague, since I don’t want to talk about particular colleagues’ careers by name, but it’s one of the things that suggests a vibrant future for crowdsourcing.
The rise of cloud platforms
In 2010, institutions that wanted to run crowdsourcing projects either had to develop their own software from scratch or install one of only a handful of tools on their own servers. At the end of the decade, cloud hosting exists for most types of tasks, making crowdsourcing infrastructure available for institutions of all size and even individual editors
FromThePage.com supports free-text and structured text transcription with plans for individuals, small organizations, and large institutions.
SciFabric runs projects using the PyBossa framework, with small and large organization plans.
Kindex supports free-text and structured transcription for individual genealogists.
HJournals supports transcription and indexing linked to FamilySearch trees for genealogists.
Emerging consensus on “free labor” and ethics
I’m happy to see that the most exuberant aspiration (and fear) about crowdsourcing has largely disappeared from our conversations with institutions. I’m talking about the idea that scholarly editors or professional staff at libraries and archives can be replaced by a crowd of volunteers who will do the same work for free. Decision-makers seem to understand that crowdsourced tasks are different in nature from most professional work and that crowdsourcing projects cannot succeed without guidance, support, and intervention by staff.
Practitioners also continue to discuss ethics in our work. Current questions focus on balancing credit and privacy for volunteers, appropriate access to culturally sensitive material, and concern about immersing volunteers in archives of violence. One really encouraging development has been the use of crowdsourcing projects to advance communities’ understanding of their history, as with the Sewanee Project on Slavery, Race and Reconciliation or the Julian Bond Transcribathon.
The first transcribathon–and the first use of the word–was by the Folger Shakespeare Library in 2014. Since then, they’ve become a popular way to kick off projects and get volunteers over the hurdle of their first transcriptions. Transcribathons are also an opportunity for collaboration: the Frederick Douglass Day transcribathons have partnered with the Colored Conventions Project, University of Texas-Austin and the New Orleans Jazz Museum worked together on Spanish and French colonial documents, and Kansas State hosted a satellite transcribathon for the Library of Congress’s suffragists letters. Some transcribathons last just an afternoon, others, like WeDigBio, span 4 days. The Julian Bond transcribathons have become annual events while the Library of Virginia hosts a transcribathon every month for local volunteers.
Directions for the 2020s
What’s ahead in the next decade?
Quality control methodologies remain a hot topic, as wiki-like platforms like FromThePage experiment with assigned review and double-keying platforms like Zooniverse experiment with collaborative approaches. This remains a challenge not because volunteers don’t produce good work, but because quality control methods require projects to carefully balance volunteer effort, staff labor, suitability to the material, and usefulness of results.
I think we’ll also see more progress on tools for source material that is currently poorly served. Free-text tools like FromThePage now support structured data (i.e. form based) transcription and structured transcription tools like Zooniverse can support free text, so there has been some progress on this front during the last few years. Despite that progress, tabular documents like ledgers or census records remain hard to transcribe in a scalable way. Audio transcription seems like an obvious next step for many platforms — the Smithsonian Transcription Center already begun with TC Sound. Linking transcribed text to linked data resources will require new user interfaces and complex data flows. Finally, while OCR correction seems like it should be a solved problem (and is for Trove), it continues to present massive challenges in layout analysis for newspapers and volunteer motivation for everything else.
The Transcribe Bentham team has led the way on integrating crowdsourced transcription with handwritten text recognition as part of the READ project, and the Transkribus HTR platform has built a crowdsourcing component into their software. That’s solid progress towards integrating AI techniques with crowdsourcing, but we can expect a lot more flux this decade as the boundaries shift between the kinds of tasks computers do well and those which only humans can do. One of the biggest challenges is to find ways to use machine learning to make humans more productive without replacing or demotivating them if experience with OCR is any indication.
First, tell us about your documents. Diary entries were written by Alfred Doten every day of his life from the day he left home in Plymouth Massachusetts in 1849 at age 19 to seek gold in California until the day he died in 1903 in Carson City, Nevada. The 79 volumes of his diaries document the social history of the West in the pivotal early years of settlement. He was especially candid about his private life. In the 1960s the diaries were transcribed and edited and published in an abridged format, incorporating about half of the content. The first phase of our project was to make the earlier publication, which is out of print, available and searchable online. Our long-term intent is to publish the ENTIRE contents of Doten’s diaries on the web, with enhancements. The website will contain everything that was part of the diaries, including photographs, theater playbills and other ephemera, and the many newspaper clippings that were pasted and tucked into his diaries.
We are depending on volunteers to complete the transcription of all of the diaries and some of the newspaper clippings. Of the 79 volumes, 55 have been completed, but we are looking for additional volunteers to finish the rest of them. The documents that are currently available on From the Page are the imperfect transcripts that were used for the earlier publication. They need to be digitized (retyped) and then those documents can be easily edited against the original handwritten diaries for the online edition. We have also used From the Page for the transcription of newspaper clippings that will be part of the project What are your goals for the project?
We hope to have the transcripts completed within a few months.They will provide the backbone for the web project. We are aiming for a scholarly edition that will be complete, accurate, and authoritative, following scholarly editorial standards. We anticipate that it will be an important primary resource for researchers, while at the same time it will be an educational resource for students and history enthusiasts and an accessible source of information for anyone looking for information on the time period and locations covered in the diaries. How are you recruiting or finding volunteers/collaborators?
A few core participants have signed on to the larger project as they learned about it, and have stuck with it and accomplished a great deal during the past few years. But it is a big project, and everyone who is involved has other obligations and involvements, so it has become clear that we need more help. I would like to see it launched in my lifetime, and I’m not getting any younger. We have found some new collaborators recently on campus, and they have brought in some ideas to enrich the project with additional features, but we need some more “boots on the ground” to do the very important transcription work before we get to the frills. Crowdsourcing was something to try, to simplify the process of working with volunteers and to allow prospective volunteers to try out transcription and to take on as much as they could, “page by page.” The platform provides an infrastructure we didn’t want to have to create ourselves, allowing us to communicate with volunteers and to manage their work.
To find our “crowd,” we sent emails to everyone we knew who might be interested in volunteering, especially retirees, and asked them to spread the word. We also posted an appeal through our personal Facebook pages and shared it on our library Facebook page. We posted an announcement in a newsletter of a local historic preservation group, and we have plans to put another announcement in the “volunteer opportunities” section of a retired faculty newsletter. I will be speaking to a local history group in a few weeks. The response has been disappointing. Several people responded to the emails affirmatively, but have not followed through. But via email we found one volunteer who was interested in transcribing newspaper clippings, and he was persistent and productive, finishing 112 lengthy articles. He is currently taking a break for a few weeks to pursue a project of his own, but he promises to return. Another transcriber happened upon the project at From the Page and found it to her liking, but the timing is problematic for her.
It has proved to be a challenge to recruit volunteers, but we haven’t given up (yet). We came to realize that summer was not the best time to recruit volunteers, because so many people were busy with yards, outdoor projects, and traveling. Late fall and winter seem like a better time to try to entice people to spend more time with their computers. Can you share your experience using FromThePage?
Despite our shortage of volunteers, I believe that FTP has the potential to further our project. It is convenient to be able to send a URL directly to the project, and it is a headquarters for the transcribing phase, with instructions and a means to ask questions and record comments. In the past we have sent entire volumes from the diaries to transcribers, but the smaller chunks we can make easily available on From the Page SHOULD allow us to attract more transcribers. And the company of some of the other projects on FTP helps give our project legitimacy and status.
We have learned from our first experiences that it was especially hard to attract transcribers for the original handwritten diary pages that we uploaded at the beginning. Nineteenth century handwriting is a challenge to read, and Doten’s handwriting is especially off-putting. It didn’t help that he used a pencil and sometimes the pages were smudgy. Our page images were high quality facsimiles, but it takes practice and experience to be able to decipher them. From the Page invites visitors to “start transcribing,” and we reconsidered the first impression we were providing. We replaced those handwritten pages with the typed transcripts, and even though we lost the “diary feel” in doing that, it seems to open up opportunities to a wider range of people, including students who may not be able to read any cursive writing.
How does FromThePage & crowdsourcing fits with more traditional documentary editing?
FTP offers a reviewing step, and we have utilized that feature, reviewing the transcriptions ourselves with the help of a student assistant. On almost every page we have caught at least one error. Our current approach, getting the earlier typed transcripts into digital format for editing, adds a more traditional step to the process, using trained and skilled editors to proofread and catch the errors from the original transcripts and any new errors that might have been introduced by volunteers.
What would you tell folks considering a similar project?
I would suggest considering the best season to recruit volunteers from your target pool, entice them at the early stages with less challenging tasks, correspond with them by email to offer thanks and encouragement, and design your “works” to be a manageable size for an afternoon or evening of transcribing, to offer your volunteers a satisfying experience.
Anything else you’d like to tell us?
I can’t think of anything, but I’d be happy to answer follow-up questions. Thanks for this opportunity to spread the word!
An interview with James Perla, the Managing Director of the Citizen Justice Initiative at the University of Virginia’s Carter G. Woodson Institute for African American and African Studies, about their ongoing Julian Bond Papers Project and their recent Bond Transcribathon.
What are your goals for the project?
The Julian Bond Papers project seeks to create a documentary edition— print and digital— of the Papers of Julian Bond, housed in the University of Virginia’s Special Collections Library.
Can you tell us about your documents?
The first stage of the project will focus primarily on Julian Bond’s speeches and public addresses; however, the full collection also contains other manuscripts: letters and correspondence, political posters, newsletter mailing campaign documents, and photographs created by Bond himself as well as members of the Student Nonviolent Coordinating Committee (SNCC)
How are you recruiting or finding volunteers/collaborators?
A large part of this project involves public engagement. We want to involve members of the general public in putting Bond’s papers online. The primary way in which the public has contributed to this project so far is through transcribe-a-thon events. Two years in a row, we hosted an event in which people transcribed materials from his collection in different locations around our university and the contiguous city of Charlottesville. The all-day event brought over 100 people each year and resulted in 1,000 or more transcribed pages.
Beyond this, it’s a major goal to involve students in all aspects of the process of creating a documentary edition: scanning, cataloging, transcribing, metadata entry, and curating project derivatives. To this end we have employed close to 10 students so far in the lifetime of the project. We believe this is critical because such exposure can show students the possibility for having a career in digital archiving, library studies, and/or public humanities.
Can you share your experience using FromThePage?
Using FromThePage had been helpful in involving members of the general public in the process. The platform has been intuitive and has allowed us to make progress with a critical aspect of the larger initiative. To date, we have over 3300 pages transcribed. The main challenge of using the platform has been the inability for multiple people to edit a single document or for a user to see if another person is transcribing a document. Since we had multiple locations for our transcribe-a-thon event, we ended up duplicating documents for each location rather than attempt to coordinate which user is transcribing which document across the various locations.
How does FromThePage & crowdsourcing fits with more traditional documentary editing?
It’s a critical aspect of our mission to involve the general public and students in the broader process of creating a documentary edition. We see crowdsourcing and FromThePage as an important opportunity to both expose the public to the documents prior to the documentary editing as well as involve them in seeing some of the questions and issues that arise during a traditional documentary editing project. For example, during our most recent transcribe-a-thon, the “notes” function on FromThePage was extremely helpful as certain transcribers would track their progress and ask questions like: “this paragraph contains an insert that supersedes the existing text, so I did not transcribe the existing text.” These types of observations demonstrate the kinds of judgment calls that the editorial team will ultimately have to make.
What would you tell folks considering a similar project?
I would encourage people to focus attention on file management and tracking metadata. We use a shared spreadsheet to track item level information from the scanned documents so that we can ultimately link the From the Page transcriptions to a separate database. In order to do this, we must track the file names associated with items so that we can transfer data between the two sites.
Bonus link! Here’s a Julian Bond transcribathon video on Twitter:
The Library of Virginia houses over 124 million archival items, including state and local records, personal papers, microfilm, maps, photographs, and more. Many of these documents contain structured data, such as the World War I Questionnaires. These are four-page forms completed by returning soldiers or their surviving kin following the Great War. The 14,900 questionnaires were gathered from 1919-1921 by the Virginia War History Commission. It covers the veteran’s personal information and war record in detail, with some open-ended questions about how the war affected their physical, mental, and spiritual states. Photographs were also requested, though not always included with the questionnaire. The questionnaires had already been digitized and indexed for a database, but we wanted to fully transcribe them to honor the 100-year anniversary of WWI. Full information about the collection can be found here.
What are your goals for the project?
The Library’s crowdsourced transcription site Making History: Transcribe has been running for four and a half years with nearly 70,000 pages transcribed. We’ve had great success transcribing manuscript materials, but have run into real difficulty when we come across forms. We don’t want to ask our volunteers to retype the form text each time, but capturing only the answers is equally troublesome since it doesn’t fully capture the meaning or text. The structured transcription developments on From the Page offered a chance to transcribe forms without losing the field text or asking our users to duplicate it for each page. Our goal is to capture all the data from the WWI Questionnaires, as well as other field-based documents, in order to increase access and searchability. Crowdsourcing has been immensely popular with our users, and we want to build on this type of interaction with ongoing projects and a variety of ways to contribute.
How are you recruiting or finding volunteers/collaborators?
We have a pretty steady stream of new and returning volunteers from the Making History: Transcribe project. HandsOn Greater Richmond provides volunteer opportunities throughout the metro area, and has been instrumental in connecting with volunteers. We host two transcribe-a-thons per month through their Volunteer Leader Program, bringing in twenty volunteers at a time to our computer classroom. We start each event by providing context for our crowdsourcing projects and demoing the sites, Transcribe and FromThePage. These events provide both training for those who wish to transcribe independently and community for those who volunteer with us for months or even years. Volunteer hours are awarded for school, organizational, and community service needs. Many of our volunteers are entirely remote and may never visit the Library of Virginia in Richmond, but they use our online resources, understand the added value of full-text, and are deeply engaged with the stories these documents tell. Social media helps spread the word about our crowdsourcing events and projects. We also conduct outreach to genealogical societies, high schools, lifelong learners and other groups interested in our crowdsourcing projects.
Can you share your experience using FromThePage?
Our experience using From the Page has been very positive. We chose an absolute beast of a form to attempt as our first structured data transcription set: 118 lines, some with 3 or 4 fields per line, with a total of over 200 fields. Ben and Sara worked with our team at the Library to upload all the images, make the form properly, and even incorporate new features we needed.
Our users have responded really well to the project. Many younger users, who may have less experience reading old handwriting, gravitate toward the more modern WWI Questionnaires. They easily understand the user interface of FromThePage. The only notable frustration occurs when the original veteran or family member did not fill out the questionnaire in a way that corresponds to the fields, but that’s human behavior!
You contributed code to FromThePage to make this project happen. Can you tell us a bit how that worked?
We spent a bit of time evaluating FromThePage’s field-based transcriptions and had a meeting to talk about what things we liked, and what things we thought could be improved to better meet our needs. Thanks to FromThePage being open source software, our Web Developer, Austin, was able to propose fixing some of the niggles we had.
Although he had a few years of programming experience and used other MVC frameworks, Austin realized he was going to have to spend some time learning Ruby on Rails before being able to start working on new features. Ruby on Rails Tutorial by Michael Hartl was a great way to get up and running. Soon after, Austin was able to start writing code for our most-needed features.
The documents we were transcribing were at least four pages, with 219 fields. At the time, FromThePage displayed all transcription fields for a document, no matter what page you were on. This create a problem where if you were on page two of a document, you’d have to scroll down to get to the transcription fields that were relevant to page two. Because the people transcribing documents are volunteers, it’s important for this experience to be as frictionless as possible. They’re devoting their time to us, and we don’t want to waste it or worse, frustrate them into giving up.
The solution we wrote involved adding a new section for each of the fields in the back-end system called Page Number that allows the creator to input the document page they want that particular field to be displayed on. Then on the transcribing page, if a user is on page 2 of the document they’ll only see fields that have a ‘2’ in the Page Number section in the back-end. No more having to scroll past transcription fields that aren’t relevant to the current document page!
At this point we weren’t sure whether we were going to use our own local fork of FromThePage on one of our pages, or try to merge this new Page Number feature into FromThePage.com. We submitted a pull request on Github to see if it would merge without any errors, and after a few attempts it finally passed all the automated tests. Austin then had a meeting with Ben and Sara to talk about the new feature, and why we needed it. Soon after the new feature was up on FromThePage.com!
We continued adding new features and submitting pull requests, and Ben and Sara were very helpful in evaluating our code and giving feedback. So far we’ve added the Page Number column to transcription fields and filtering of transcription fields by current document page, styling for field labels on the transcribe page to help with readability, added a new “label” type of transcription field that has no actual input box associated it, added a new “instruction” type of transcription field that shows to transcribers to help give them extra instructions, fixed a bug with capitalization of transcription field labels, and started initial work on adding unique IDs for exports that Ben has since taken over development of.
Although some of our features were specific to our use-case, we hope most of them can also be useful to other organizations working with field-based transcription projects. It was great that Ben and Sara were able to take the time to understand why we were creating the features we were, and that they were so accommodating in merging our code in with FromThePage.com!
Anything else you’d like to tell us?
We look forward to continuing to work with users and new technology, such as FromThePage, to improve our collections and generate new interactions. Thank you for working with us on all the quirks of these WWI questionnaires!
Sonya Coleman, Digital Engagement + Social Media
Austin Carr, Web Developer
Liz Coelho is the Executive Associate for Projects at the Maryland State Archives. She kindly agreed to be interviewed by Sara Brumfield about the Archives’ use of FromThePage.
Please tell us about your project
The Maryland State Archives currently has one fielded-form project posted on FromThePage, a collection of marriage certificates from 1978. No index exists for these records–over 50,000 certificates–which makes finding any given record an act of manual labor as the paper copy must be searched for by hand. Add to that the difficulty of searching when the patron doesn’t remember the date of their marriage but desperately needs the certificate to apply for social security benefits, a passport, or a driver’s license renewal. Having a digitized, searchable index of these records will make an enormous difference in getting this important document into the hands of the people who need them to conduct the daily business of life.
How does your project differ from the manuscript-based projects on FromThePage?
The fielded-form transcription format on FromThePage has made this project possible. The Council of State Archivists organization worked with you and Ben on developing a fielded form in addition to the manuscript pane interface that was already in use. This allowed volunteers to quickly transcribe data that had been typed onto a standard form. Marriage certificates contain the same information on every record: the names of the parties, the date of the marriage, the county where the marriage took place and, most importantly, the certificate number. Once downloaded from FromThePage, the information entered into each field will comprise the data sets in a searchable database.
What are the advantages of the fielded form?
By adding a field-based option, FromThePage offered project owners a way to have a very large number of standardized records transcribed in a very cost-effective way. The Archives doesn’t have the staff to assign to a data-entry project of this magnitude, and I know from working as a freelance indexer that the fee-based cost of this project would have been about 50 cents per certificate if we’d had to contract the work out. There simply isn’t the funding for that kind of expense. The fielded-form, along with the desire of the volunteer transcribers to get involved and help create something useful for the benefit of many, has made this project possible.
Who are your volunteers?
The volunteers I’ve talked with all seem to share a very professional, results-oriented attitude toward the project. They like the discipline of accurately transcribing a record and then moving forward to the next. Individual volunteers working on our project have transcribed hundreds, and sometimes thousands, of records. I’ve transcribed certificates, and I can definitely say that there’s a lot of satisfaction in watching the green percentage-completed line advance to 100%. When we first started the project we reached out to local genealogical and historical societies. Marriage certificates are not only important legal documents, they’re also rich sources for family research. They contain the bride and groom’s place of residence and place of birth, their prior marital status, where they were married, their age at the time of marriage. So family researchers have a stake in making these records more accessible and are a big part of our volunteer group. Archives staff members have also participated in the review process, the final step before a file is downloaded into our database and considered completed.
How does crowdsourcing fit within a state archives?
Crowdsourcing is actually a great fit for a state archives. Private individuals who use Archives’ resources often become subject-matter experts and actively look for ways to share their knowledge. On any given day in our public Search Room, one knowledgeable patron will lean over to offer advice and perspective to another patron. They come in on their own initiative and transcribe a colonial court ledger that interests them. They visit old graveyards and compile the information found on crumbling tombstones, and then donate their research to the Archives so others can benefit. They’re often as dedicated to preserving the historical record as professional archivists. So crowdsourcing is just an efficient way of reaching out to all these interested people and offering them a specific project for them to consider working on.
On March 8, 2018, I presented this talk at a conference at Michigan State called “Enslaved: People of the Historic Slave Trade“. The conference was live-streamed, so a video recording of all presentations is available on the MATRIX YouTube channel. My presentation starts at 1:04:55 and is embedded below, followed by slides and text.
It’s important to begin by explaining that this project is the result of two collaborations; one between myself and the primary Stagville Accounts researcher, Anna Agbe-Davies, who is an anthropologist and historical archaeologist at UNC Chappel Hill. The other collaboration is a broader, 2018 Mellon-NHPRC funded initiative to explore publication of historic financial records as digital editions in linked data formats. (This slide only lists members of the DEPCHA team involved directly in encoding Stagville accounts; the cooperative includes many other scholars and technologists.) I regret that my colleagues were not able to join me today, and–since I am a software engineer–that my methodology may be over-technical and my history may be naive.
Historic Stagville is a state-run historic site north of Durham, North Carolina, which highlights life on a tobacco plantation, focusing on the perspective of the enslaved community there. The site was owned by the Bennehan and Cameron families from the late 18th century for the next hundred years. Their property holdings on the eve of the civil war included thirty-thousand acres of land and nearly nine-hundred enslaved human beings. You see here a rather unusual dwelling for enslaved people, one of four two-story, four-room timber-framed buildings still standing on the site.
The Cameron Family Papers are held at the Southern History Collection at UNC, and incldue substantial business records. Some of these record transactions between the plantation store and members of the community. Of particular interest to this conference is the “slave ledger”; a separate account book recording transactions between the Cameron store and and enslaved customers.
When examined, the ledger provides rich details of a portion of the economic life of the customers recorded in it. For example, here is the 1810 account of “Walker’s Davie”, who buys half a dozen awl blades, then sugar and a shoe knife, then a pair of leather soles. He pays his account by half-soling one pair of shoes, then by making one pair of shoes for someone named Sam.
The enslaved account-holders are described in various ways, patronym, ownership, occupation, nicknames, and surnames. Most–but not all–accountholders are male.
How do enslaved customeres pay their accounts? For this we have to turn to quantitative methods: most commonly they pay by providing wood, cash or labor–often with other account-holders as intermediaries–but we also see other interesting things like manufactured goods–shoes and blacking–or perhaps capital in the form of a share in a canoe.
The Stagville slave ledger does not exist in isolation. The project has also encoded portions of the “white ledger”, daybooks recording the Cameron store’s transactions with white customers. Often these transactions take place through an agent; often a family member but sometimes a slave. Here we see James Haley buying a shoe knife; which Cameron sells to him at one shilling sixpence; earlier we saw him selling a shoe knife to Walker’s Davie at two shillings; was the third-again higher price due to the two year difference between the transactions, the quality of the knife, or the status of the customer?
Agbe-Davies encoded the ledger with the open-source transcription tool FromThePage, which is run by Brumfield Labs. She sponsored development of tabular encoding within that tool, allowing account records to be displayed and extracted to spreadsheets for analysis. However, the edition still existed in isolation, so that linking records from related sources (like the Washington Financial Papers) was not possible.
To create a shared, analytical database, members of the DEPCHA cooperative turned to GAMS, the Geisteswissenschaftlisches Asset Management System developed at the Univeresity of Graz. This tool takes texts transcribed in TEI with a special set of analytical tags and adds them to a Fedora-based RDF database. GAMS specializes in historic financial accounts, and was originally developed for medieval and early modern European records.
The special tags are defined in the Bookkeeping Ontology, and can be applied to any TEI element. They encode most of the concepts encountered in financial records, especially the single-entry accounts we worked with during the DEPCHA period.
FromThePage is able to export transcripts in TEI, so as part of the DEPCHA cooperative, suppport for the bookkeeping ontology was added — you see these little hash-tag looking things in the ana attribute of several elements.
Once ingested into GAMS, the transactions are viewable outside the context of the manuscript page, so that entries showing both powder and shot can be viewed together, even though they appeared on the seperate accounts of William Pettigrew and James Haley.
Because GAMS supports accounts from multiple sources, researchers can compare transactions involving gunpowder in Stagville accounts against those in the Laban Moreley Wheaton daybooks or the Washington Financial Papers.
Visualization tools allow analysis of other commodities bought in the same transactions as gunpowder. (Unsurprisingly shot and gunflints show up, but so do pins and sugar.)
Our hope is that–once the encoding and conversion of the ledgers is completed–the project will shine new light on the lives of the enslaved community and their emancipated descendants. The DEPCHA team is interested in collaborating with other scholars editing historic financial records.
This is a response to the recently published “A Research Agenda for Historical and Multilingual Optical Character Recognition” by David A. Smith and Ryan Cordell, with the support of The Andrew W. Mellon Foundation. The report analyzes current challenges faced by humanities researchers using OCR text and outlines important avenues for research to improve OCR quality. In many places, the report calls for transcription systems and crowdsourcing experts. Since we have run crowdsourced transcription systems for more than a decade, we have a lot of ideas on how transcription systems could be used as part of the solution. Our ideas are specific to FromThePage, our platform, but many could apply to other tools. We thought the best way to be part of the conversation was to share our ideas publicly. If you’re working on solving the problems outlined in the report, we’d be very interested in collaborating with you.
The report has nine recommendations; our thoughts will be organized accordingly after introducing FromThePage.
Before we dive into potential implementations, an introduction to FromThePage is in order. FromThePage is a collaborative transcription system used by libraries, archives, museums and academic researchers. It is open source software (available under the Affero GPL 3.0 license) and is deployed at the University of Texas, Fordham University, Northwestern University, and Carleton University among others. We also offer FromThePage as a software-as-a-service. For $1000 to $5000 per year, we host projects for organizations and individuals on shared servers that we run and maintain; this shared infrastructure approach means that projects can be up and running in hours without technical expertise.
While FromThePage is most popular as a crowdsourced transcription platform, it is also in use as an OCR correction platform. Our current OCR ingestion integrations are from the Internet Archive (i.e. the LatAm project) or ContentDM (i.e. Indianapolis Public Library). Our multilingual support means we host projects in a wide variety of languages: Old French, Spanish, Malay (Jawi), Arabic, Nahuatl, Mixtec, Urdu, and Dutch. That support includes right-to-left script support (top to bottom is coming soon). We also support collaborative translation of transcribed texts.
FromThePage has been in ongoing development for thirteen years; our first deployment was for San Diego Museum of Natural History in 2010. FromThePage.com, the software-as-a-service solution, has been in use since 2011. Our approach to sustainability is a combination of SaaS subscriptions and “sponsored development” — we collaborate with institutions and individuals to build needed features. Recent examples include right-to-left script support sponsored by the British Library and field-based transcription and ContentDM integration sponsored by the Council of State Archivists. While both the SaaS subscriptions and feature development are often funded by grants, the shared cost approach means that no one grant program, institution, or granting organization carries a large burden for software development. All new features are available to all users of FromThePage.
OCR Improvement Ideas
Recommendation #1: Improve Statistical analysis of OCR output
“Inferred” OCR quality statistics can be approached through entirely computational models, but we suspect that a human identifying and correcting a small portion of an OCR’d text — 5-10 exemplar pages or 1000 lines — could provide better input into OCR quality. A scholar considering 3 or 4 different corpora might be motivated to transcribe those exemplar pages for each different sets of materials; a comparison of before-and-after versions of the text could lead to an informed statistic of how good or bad the OCR for each was. Since the scholar’s corrections provide the gold standard–quality statistics can be calculated for texts with non-standard orthography like early modern printed work or multi-lingual texts. FromThePage currently keeps page transcription versions in our database and presents a “diff” view to end users, but would need to be modified to count corrections and calculate error rates in the uncorrected OCR.
To take this idea even further, if a researcher spent the time to correct some number of pages of a low quality OCR text, those corrections could be used to retrain an OCR engine as each page is completed. The retrained model could be applied in batch to a similar corpus. Better yet, the retained model could be applied to all the subsequent pages in the corpus. The result would be a virtuous cycle of OCR text that continually improves as it is corrected, needing fewer corrections the further the editor works through the text. This emergent model of OCR correction retraining and application could be integrated into a standalone service accepting contributions from many editing platforms or could be integrated directly into an editing platform like FromThePage. The labor to correct the OCR in this model is very motivated by the immediate improvement in the text they are correcting and by the ability to improve the OCR text of their particular corpus.
Recommendation #2: Formulate standards for annotation and evaluation of document layout
We don’t see that FromThePage is a good match for this recommendation, but we would encourage others to look into the work being done by the IIIF Newspapers Community Group for open standards to support this task.
Recommendation #3: Exploit existing digital editions for training and test data
We think that there are a lot of possibilities here. We already have digital edition projects that start with OCR correction, like the James Malcom Rymer Collection. We also have crowdsourced transcription of typewritten text in projects like the Papers of Julian Bond. These transcription/edition projects–and the more straightforward OCR correction crowdsourcing projects–currently produce edition text as outputs, but the act of correction (and creation of human-edited text) can be leveraged to produce training data for OCR engines, or even trained models as a separate export.
Recommendation #4: Develop a reusable contribution system for OCR ground truth
This recommendation is where FromThePage is already in use, and can easily continue to play this role. In 2018, the British Library identified a need for an Arabic manuscript ground truth dataset to improve HTR models. They used FromThePage to crowdsource the transcription of 85 pages of Arabic Scientific Manuscripts.
Recommendation #5: Develop model adaptation and search for comparable training sets
Recommendation #6: Train and test OCR on linguistically diverse texts
(no comment beyond FromThePage’s aforementioned support for a wide variety of languages.)
Recommendation #7: Convene OCR Institutes in critical research areas
Both crowdsourcing for the creation of ground truth data sets and OCR correction takes motivated groups of people, be they scholars, students, or the public. Building that community and using them across similar projects that they are motivated to work on is not a trivial undertaking, but would be a key task of domain centric OCR institutes. We often refer to the medievalists as “early adopters” in the digital humanities, but we saw this sort of community build as the Parker Library at Corpus Christi College, Cambridge put their Anglo Saxon manuscripts online for transcription and reached out in a variety of ways to their scholarly community.
Hosting all the projects on a central platform like FromThePage that facilitates different types of projects — public and private, transcription or correction — and makes it easy to find a project of interest and share it would make sense. It would also give the community manager (a required role for any institute) a centralized way to coordinate and communicate with volunteers/contributors.
Recommendation #8: Create an OCR assessment toolkit for cultural heritage institutions
Recommendation #9: Establish an “OCR Service Bureau”
Using FromThePage for OCR correction and ground truth dataset creation could save a tax-payer money in the creation of such a service bureau.
Although it isn’t mentioned in the OCR report, we believe IIIF is the right solution to moving page images and text through data pipelines.
We have thought quite a bit about how you attach text (in many formats — plaintext, HTML, TEI, optimized for search or analytics) to page images in IIIF. The FromThePage API provides one example on how you can do this; our approach is shared by Jeffrey Witt’s Sentences Commentary Text Archive. We would encourage any system implementers for OCR improvement to use IIIF to transport page images and text together. The IIIF Image API works allows you to target specific regions of an image (say a line) with a specific body (say the transcription of it) in a web annotation, and also to link full page images to external resources like ALTO files.
A discussion with a scholar recently about working with OCR’d text of early international law broadsheets from the Bibliothèque nationale de France made us realize that exposing the original OCR format (i.e. ALTO) as a seeAlso link in the IIIF manifest would give scholars the ability to go back to the original OCR bounding boxes and perform their own transformations specific to their unique text and projects. In other words, until we “solve” the problems with OCR of historical texts, exposing raw OCR files is one way to increase the number of approaches to processing and improving OCR output.
We’d suggest looking at the following resources for thinking about IIIF in the context of OCR improvement and text processing pipelines:
Jason Ronallo and his team at North Carolina State University Libraries built Ocracoke, an OCR pipeline built on IIIF.
“Round Trip to Paradise” — presentation on how John Howard at University College Dublin is using FromThePage’s IIIF API to “roundtrip” from Fedora into FromThePage and back.
I’ve just returned from a Prosopography Hackathon at the University of Vienna, a three day long digital humanities event to “hack” databases of people and biography. After a short brainstorming session, I volunteered for “information extraction” (getting information out of texts), but my three-person team had dissolved by the afternoon of the first day. I feared I’d have to spend the rest of the hackathon helping with API specifications. I was rescued by Maxim Romanov, who commented at dinner that night “Do you know why no one wanted to do information extraction?” “No, tell me!” “Because the techniques you referred to work great on English, but look at all the languages represented by the prosopographies at this hackathon — Chinese, Arabic, Greek, Syriac, Georgian” (and usually old or ancient versions of those languages). There’s nothing more fun than a well-formulated problem, so I mulled over my ideas — applying machine learning techniques — with Maxim’s observation in mind. When we reconvened the next morning, I pitched my new, improved idea: could we build machine learning models for some of these esoteric languages?
Six people thought this sounded interesting (vindicated!), so I had a group; we spent an hour researching ideas (Spacy? DataTurks? Vector Spaces?) before settling on building named entity recognition using Spacy for first Ancient Greek and then Classical Arabic.
Working with convolutional neural networks — what’s under the hoold of tools like Spacy and TensorFlow — requires twisting your brain in some new ways. My team at the hackathon thought “entity first”; we pulled a list of entities from a document and used it on our first try at building a model. Here’s the deal: machine learning is not about string matching! It’s about statistics — what is the likelihood that *this word* at *this spot* in *this sentence* is an entity? You can only do this with context. Our second try was with context. We were working with an Ancient Greek text provided by Rainer Simon, and the question then was “what context”? The text didn’t have punctuation we could use to separate sentences — what all the Spacy examples were based on — so we settled on using the “line” in the original text.
The training data ended up looking like this:
That is, a full line of the text, followed by a list of the tagged entities in the text and their type. The entities are identified by their character ranges in the text of the line (i.e. character 37-43 is Λίνδον) and the type of entity this is (LOC); in this case the type is location but it could also be person or some other entity types. (Our training data was from Recogito, a part of Pelagios, so it is mostly locations.)
The training script we put together (in a nice handy Jupyter Notebook) was mostly cobbled together from the Spacy documentation, with enhancements as we thought through things; most of the actual coding was by Miguel Vieira from KCL. Here’s a basic outline of what it does:
loads the data
randomizes and splits our training data 90%/10% — the larger set for training and the smaller for testing.
creates a blank model to start with, with a default language. (In our case “el” for Greek — it’s modern Greek, but seemed to work. There’s also a “xx” language for no starting language. We’re not sure this was important — it would be interesting to test the ancient Greek model with “xx” to see if the results differed.)
labels the data in the way Spacy wants (I think if we had used the command line spacy-train command our JSON data may have “just worked.”)
trains. The comment on the code here is “Loop over the training data and call nlp.update, which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations, to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.”
tests. Take the 10% of the data we held back and see how well the model predicts entities; compare the prediction with the actual answer. There’s two ways to measure accuracy of the model: Did it successfully find an entity we knew about? Did it identify a part of a line as an entity and it shouldn’t have? (false positives). Our model successfully identified about 60% of the entities we had tagged — not bad! We calculate “precision” — the number of correct results divided by the sum of correct results and incorrect results (false positives): 69% and “recall” — the number of correct results divided by the number of expect results: 75% (more on precision and recall https://en.wikipedia.org/wiki/Precision_and_recall)
Our next task was to apply the same strategy to classical Arabic texts. Our data set–place names in a classical Arabic biographical dictionary–was provided by Maxim Romanov. In this case, like the previous, we started with a list of entities that had already been pulled ouot of the text. We had to put them back in and decide what context to include. To make things easy (& fast) we decided to grab the context from around each entity (entities were placenames in the biographical entries) — 5 words before and 5 words after. After training our model we realized this was a bad idea — our test was 100% accurate. Great, right? Well, remember what I said about machine learning being based on statistic probability? Our model was trained on a set that taught it every 6th word was an entitiy; then tested it on text where every 6th word was a location — of course it got them all correct!
We ran out of time to correct the problem, but our next try would have included the entire dictionary entry as the context. Most entries were 5-10 lines long, but some were pages — would it make sense to keep or throw out the long ones? We don’t know. That’s just one example of the different possible “levers” that could be adjusted for a better result. Others levers include the amount of training data (the Greek text was only 600 lines long), the number of iterations the model ran — we experimented with this one with the Greek model and got the same results from 10 iterations as we got from 50. Batching vs not batching. Mattias Schloegl from the first day mentioned “active learning” with spaCy — training on low certainty input — if we could get the results that our model testing got right, but wasn’t particuarly confident of, and added them to the training set, then our results might improve. We didn’t spend any time trying to understand the variables here: ; a deeper understanding might lead us to a better settings.
But the key take-away for the digital humanities is that we did it — we trained a named entity recognition model from Ancient Greek texts in less than 2 days. And given another day, we would have had one for classical Arabic. Because spaCy is about the statistics, not the text, it had no problem with ancient Greek or classical Arabic — and it probably wouldn’t with any other UTF-8 language.
The hackathon team consisted of:
So you have documents you want to read or transcribe–but they contain shorthand! What to do?
As coursework in shorthand has dropped off the curricula of high schools and secretarial schools have withered or transformed into business programs with an emphasis on word processing skills, knowledge of shorthand is not nearly as common as it once was. With fewer people skilled in shorthand, it may be difficult to find someone to “translate” any of the mysterious, squiggly lines you come across in 19th and 20th century documents. Rather than just write shorthand manuscripts and marginalia off as indecipherable, why not approach them as one would any other unfamiliar hand–and teach yourself! Let’s review some of your options.
Shorthand systems are based on either a stenographic approach, which uses simplified letter forms, or an alphabetic one, which relies on mere abbreviation (and is thus not considered a true shorthand by some purists). Stenographic shorthands are further broken down into geometric or script systems. Geometric shorthands make use of circular forms coupled with straight lines following very precise rules; examples are the British Pitman (also somewhat popular in North America), Boyd’s Syllabic, and Samuel Taylor’s Universal Stenography.
Script systems are based on the movements common to everyday handwriting; this system is more common in Germanic-language countries and Eastern Europe. Other hybrid stenographic shorthands developed in Japan and Italy. The most common American shorthand since the late 19th century has been Gregg; this stenographic system is a compromise between geometric and script, based as it is on ellipses. Pitman remains in use in the United Kingdom, but it has been largely superseded by the spelling-based Teeline system since the late 1960s; Gregg shorthand is what most 20th century North American secretaries learned and used, so we will focus on it.
Once you are ready to give it a try yourself, check out Bai Li’s post on his Lucky’s Notes blog, An introduction to Gregg Shorthand and an attempted English to shorthand converter. This user friendly how-to post will walk you through the basic logic of Gregg shorthand, including letter forms, how the forms joins up to make word outlines, how the outlines are further abbreviated, and the author’s attempt to create an automated shorthand translator.
Getting serious about shorthand
If you decide you’re ready to do a little more work to get up to speed (see what I did there?) for accurate transcription work, you’ll want to explore more formal learning tools. Websites you’ll want to refer to as you get serious include:
An exhaustive collection of resources, including descriptions of the various versions, revisions, and editions of the original Gregg manuals (first through fifth editions, 1888-1928; Anniversary; Simplified; Diamond Jubilee; Series 90; Centennial; and even German- and Irish-language editions), plus the full text of the 1929 Anniversary edition in HTML. The charts of letter forms are particularly helpful.
Yes, a message board! And it is currently in use, right through the date of this writing (February 2019)–with more than 4,000 users and sections for discussing beginner, intermediate, and advanced shorthand ; drills; transcription tips; and more. This seems like a good spot to ask for real-time feedback on any troublesome transcription issues. This may also be a good place to identify which edition of the Gregg textbooks is best for your project, timeline, and learning style.
Prefer video learning? YouTube user “Shorthandly” has been busy the past few weeks uploading a full Gregg shorthand course. At 44 videos and counting, this series may be useful to the casual learner. After several introductory videos showing how to write each stroke properly, each video focuses on a particular letter form or shorthand convention for ten minutes or less.
The consensus regarding the best book for the beginner seems to be:
You’ll want to pair any manual you choose with an appropriate Gregg shorthand dictionary. This may be trickier, as none of these appear to be currently in print, so you may have to hunt around to find a used copy. This one is recommended to pair with the Simplified manual. Once you have familiarized yourself with the basics, a dictionary may also go some way toward helping you interpret the shorthand you’re trying to transcribe, too. The Gregg Shorthand website recommended above has helpfully scanned many of the out-of-print dictionaries and shares them in PDF format, along with many other handy reference documents. For instance, here is the dictionary that accompanies the 1929 Anniversary edition. The only problem here for the 21st century transcriber: obviously, entries are only word-to-outline! So you will still have to have a rudimentary understanding of shorthand letter forms and outlines to know where to begin searching.
Keep in mind that, while these resources may serve as guides as you begin to make sense of the marginalia you come across, truly mastering Gregg shorthand requires intensive study over an extended period; memorization of thousands of joined letter form outlines is necessary to be really proficient. And, as the Gregg system includes outlines for around 30,000 words, it is not exhaustive of English vocabulary; secretaries improvised their own outlines, as well. However, it is also important to note that reading shorthand is different from writing it. Unlike the early 20th century clerks mentioned in Hollier’s article, you’re probably unlikely to be competing in a timed shorthand writing contest. Your focus as a transcriber working with shorthand is necessarily more concerned with occasionally interpreting the hand. Though shorthand requires real study to become proficient, if a child can do it, so can you!
Crowd-sourcing shorthand solutions
Have any of our readers taught themselves shorthand outside a traditional course? How did you do it? Do you more regularly come across Gregg, Pitman, Teeline, or another system? What are your strategies for transcribing it? We would love to hear about how the transcription community is dealing with this issue as mastery of shorthand becomes an ever-rarer skill. Given how many retired volunteers are working on crowdsourcing projects, it could be that there would be plenty of shorthand reading potential volunteers out there. We won’t know until someone tries a project. If you’d like to try, contact us.