Last June, Ben Brumfield of FromThePage and Austin Mast of Florida State University/iDigBio had a meaty discussion of error and quality in crowdsourced transcription. The discussion, which was moderated by Sara Brumfield, started with the dimensions of quality, compared multi-track (i.e. many transcriptions of the same bits) and arbitration to single track (i.e. a one-person transcription) statistical sampling and review, and discussed how to quantify quality and how early intervention can improve transcriber skill. You can sign up for future webinars here.
Read the transcript of the webinar below:
Sara Brumfield: Okay, great. Then let's get started. So I'm Sara Brumfield and I'm going to be moderating this discussion. We wanted to welcome you to our webinar on improving error and quality in crowdsourced transcription. This is a new format for us. We're structuring it as a conversation between Ben and Austin, who I'll introduce both of them in a minute, but many of you in the audience have experience in this field and we'd like to encourage you to participate in the discussion and to add your own experiences. So after we hit each topic, we'll be opening the floor for you to share your experiences or to ask questions, or just have thoughts or comments. We really would like this to be a conversation.
Sara Brumfield: Because of that, we're not quite sure how long it's going to take, and part of that depends on how much y'all participate and how much conversation we have, we don't really want to cut that off. It may run longer than an hour. If it does and you have other obligations, please feel free to drop off. We'll be sending out a recording of this afterwards and you're welcome to share it with folks who aren't here, or obviously to watch it yourself. I also wanted to mention that our July webinar in about a month is on teaching with primary sources, using FromThePage. So if you or any of your colleagues would be interested, we'd really appreciate your help in spreading the word about that. I'll drop the link to it in the chat in just a minute. Today's conversation is between Austin Mast and Ben Brumfield. I'm just going to ask Austin to introduce himself.
Austin Mast: Sure, I'm happy to. Hi everybody. I am Austin Mast, I'm joining you from the Colorado Rockies today, but I have an appointment at Florida State University, as a professor in the Department of Biological Science. The fact that I'm wrapped up in my jacket is due to a cooking incident that I had earlier and I'm airing out the cabin, and it's quite cold at this elevation.
Austin Mast: I do a number of things. One of the things that I am responsible for is the domain responsible for digitization, workforce development, and citizen science at iDigBio. iDigBio is a NSF funded, for lack of a better word, center. We're on our 11th year. We're not a formal center, but we do have a large budget and we do a lot of community coordination on a lot of fronts, and we're focused on advancing digitization of biodiversity collections. I'm director of a biodiversity collection, the FSU herbarium. It's a collection of about 300,000 plant specimens collected over the last 150 years. I'm also on the board of the Citizen Science Association. I'm an officer in that organization, I serve as treasurer. Thanks.
Sara Brumfield: Good enough. It's interesting, Austin comes from the citizen science side of crowdsourcing, and he and Ben had the conversation that that led to this webinar, working on the Collective Wisdom Handbook and they wanted to continue that conversation, so that's where the genesis of this webinar came from. Ben, would you like to introduce yourself quickly?
Ben Brumfield: Sure. I'm Ben Brumfield, creator of FromThePage, which is a crowdsourced transcription platform that Sara and I run together. We've been researching and following crowdsourced manuscript transcription for just over 17 years now, so hopefully we'll be able to talk through and concentrate some of the things that we've observed over that last decade and a half.
Sara Brumfield: Great. Okay, so the first thing we wanted to talk about are dimensions of quality. Austin, would you kick us off and tell us about the different dimensions of quality?
Austin Mast: Sure, I'm happy to. I have a Google slide deck that I'll rely on if I can share my screen and I can get that going. Okay, you should be able to see that, I'll slip into slideshow mode. A lot of the conversation content that I bring to this was developed in conversation with Ben and others at a book sprint that he and I were involved in that produced this book. The book is on crowdsourcing and cultural heritage, and it is combining the experience of a bunch of people. It was a team effort and it was a great team, led by Mia Ridge, Sam Blickhan and Meghan Ferriter. But Ben and I were involved in many corners of this book, but he and I were particularly focused on data quality.
Austin Mast: This is a series of slides that really offer a window into my thinking on this. It could be that I'm falling short. Maybe some of you have some additional insights that I really should have included. This is not heavily citing prior work, it's just the representation of my thinking on this in the last 10 years, I've been involved in crowdsourcing, biodiversity collections, data creation. So we have the world or the universe and we are seeking to document it with data. I think of this as being an action carried out by agents, following protocols. If I could just pause for just a second, my wife just showed up and I have to explain the cooking incident that happened, so just 10 seconds.
Ben Brumfield: Excellent. We're going to have to find out more about this cooking incident.
Sara Brumfield: Yes, I think we're going to. I think we should find a link to the Collective Wisdom Handbook, which Austin just posted, but I'm going to see if I can find that and drop it into the chat.
Austin Mast: That's great, thank you. Sorry about that. The agent is implementing protocols to produce data about a subject. And sometimes the agent is human, sometimes the humans are helped by computers. For example I, during the nineties and aughts carried around a GPS unit and I was recording where I was, I was recording it on my labels for my specimens. Sometimes the agent is a computer.
Austin Mast: That data, including the labels that I created in the nineties and the aughts might not be in digital format, but we are seeking to do that often, many of us, and some of that is being done by humans, some by computers, some by some combination of the two. In the end, agents implement protocols to produce information from data, and knowledge from information, et cetera. And this has some purpose. It could be discoverability of the resource if you're doing it for cultural heritage collection. It could be for research, it could be for policy making and the standards for that data will be different depending on the final use. I think a data quality is being determined by the fitness for purpose and those requirements should be articulated as data requirements at the beginning of the project.
Austin Mast: I just want to mention the value of reuse of data and point at the FAIR idea. FAIR stands for findable, accessible, interoperable and reusable, and this is often discussed in the context of reuse by computers. Data requirements might address some of the features I list here, and I'm just going to go through these. This is informed by the work that Ben and I did for the book, but also some additions on my part, in part in conversation with Rick Williams who's here at the Rocky Mountain Biological Lab.
Austin Mast: These attributes might look... Well, quality management steps will look different at different steps to maintain the features that I'll inventory for you, keeping these above some thresholds of quality. Accuracy is something that I think about as the degree to which data corresponds with generally agreed upon reality. Accuracy is something that will look different depending on the step that you're involved in. If I'm out there in the world estimating the number of Florida panthers in a park, I'll be more or less accurate. I might give you a number in a handwritten report, and then accuracy becomes whether or not the number was accurately represented in the database.
Austin Mast: Replicability is something that's very important in science. Replicability is the degree to which relevant features of the data can be replicated with an independent application of the protocol. It could be that some things are not replicable. It could be that I did my count of Florida panthers in 1992 and we have no way of going back in a time machine and determining whether or not I got the right number.
Austin Mast: Precision is the granularity of the data. This could often be described in terms of spacial or temporal scale. For example, if someone's recording that something flowered in spring, that's of a different precision than them recording that it flowered on May 29. Precision can increase along this path, as we can later benefit from the aggregate data that's being produced so that we can determine whether or not the... There are outliers in our aggregates and we could be... I'm sorry, we could be informed by the collector's activities in such a way that we can get more precise, looking back on their record of flowered in the spring. It can also get more precise in a way that's inappropriate. It could be that someone is interpreting m. as meters rather than miles, as it was originally intended.
Austin Mast: Fidelity is the degree to which the agent follows a protocol. Sometimes this is described as accuracy, and I would suggest those two things are different. Completeness is the degree to which the agent applies the protocol to all in-scope content. It could be that we as a group, who are maybe not involved in transcription but involved in queuing up things for transcription, our actions might lead to less completeness in the data. For example, there could be image corruption, there could be a failure to use check sums when transferring data.
Austin Mast: Bias is the degree to which qualities of the output vary by agent. Oftentimes we can tackle this most effectively when we recognize that there's systematic bias, but this is the way I think about bias. Timeliness is the degree to which the data are up to date. It could be that names are no longer being used and they need to be updated, names and could be geographic place names. I saw a recent change of a place name in... Was a mountain name, in Yosemite I think, so the earlier name would be out of date. But this also happens with taxonomic names and many other things.
Austin Mast: Orderliness is the degree to which the data conforms to relevant standards, and that's going to be especially important when aspiring to the recommendations of the FAIR movement. Auditability is the degree to which the data provenance is documented, that is the subject, the protocol, the person or agent, and the data are all mapped. Finally explanation is the degree to which interpretations are documented. There's a nice... I don't know, maybe some of you are involved in genealogy and there are nice guidelines for the genealogical proof standard, and that involves an explanation where logical conclusions are drawn, based on the evidence. That's the sort of thing that I'm talking about.
Austin Mast: That's how I think about data quality. In terms of error, which is the other topic of the conversation, error is really anything that diminishes aspects of data quality from what could be possible. Well, not all errors matter is what I'd like to add to that, that is relevant to the data requirements, some things might be fine to continue to include in your data set.
Austin Mast: I'm going to stop sharing. That's just an introduction to some of the facets that I've recognized for data quality, and some of that is relevant to the conversation today, that is we're focused on the center of that pathway at the top, we're focused most on transcription and the creation of digital data, about the records. But the context is important, I think, to recognize the entirety of the path in our conversation.
Ben Brumfield: Austin, you mentioned that accuracy and fidelity are sometimes called the same thing. I wondered if you could give us an example of where accuracy and fidelity are not the same thing?
Austin Mast: Let's think about this. In the case of transcription, since we're talking about transcription, accuracy would be whether or not the text string that has been produced for the field, is reflecting the text string that's written in the document. It could be handwritten text, it could be that the person is misinterpreting the text and that leads to inaccuracy. It could be that they're applying the protocol as best they understand it to be, and the protocol just does not... It could be that their fidelity to the protocol is perfect, but their accuracy is off because they misinterpreted an O for an A. Is that-
Ben Brumfield: Well, let me give you a potential example of how I think about this from the archives world, and see whether that actually matches your definition. There are a lot of times of which you'll have vital records about a person who's perhaps deceased and you will have one of their descendants come in and say, "Oh no, no, no, no, their name is wrong on this document, you need to change this," right? And maybe the name actually is wrong, right? Maybe the name actually was misspelled and the person coming in is correct, but that change would make the transcription or the index less faithful to what the document said. If the document says Toler is spelled with a W and someone comes in and says, "That's not how so-and-so spelled their name, they didn't have a W in there," that person may be right, but the document with a W spelling, in order to be faithful to it, any kind of index needs to have that W. Is that a fair distinction?
Austin Mast: I think so. There are those two steps that I showed. One is that there is accuracy that can be assigned to the first step, which is a handwritten document, and that accuracy could need correction later, because as you find there were... I see this in old censuses, they... Well maybe that's not the best example.
Sara Brumfield: Evan actually has a really great example from old census forms, if you want to jump in Evan, and share your example? Sorry for interrupting, but...
Evan Roberts: Hi there. Sorry. Yeah, I mean in the chat I just said, yeah, accuracy would be writing down something that was literally true to the characters on the page, but if you think about it in the context, and I think this comes up as well, I think about demographic things. Austin, you're probably thinking about biological things, maybe temperature is another example of where something would be clearly out of bounds, but you can often infer from other items on the page or your knowledge of the range for that variable, what it could be. That's, I guess how I think about it, I don't know if that's a helpful distinction. Yeah.
Ben Brumfield: Your point about being out of bounds reminds me of another example of this. I remember talking to the folks running the old weather project about 10 years ago and they were transcribing ships' logs, especially with temperature, latitude, longitude and dates readings from World War II, from the Royal Navy. And when they started mapping these, they found all these cases in which ships would jump from someplace in the Indian Ocean, up into the Himalayas and then back again. Well, so this is obviously wrong. Obviously the transcribers have written the wrong thing down, and they went and they looked at the records, and actually it's the officer of the watch who wrote down East instead of West, or North instead of South, or something like that. So the ship's not in the Himalayas, but if the document says that it is, what do you do then?
Sara Brumfield: I suspect that's different for every project, and where and when you're processing the data, and what you're doing with it. In their case they were mapping, I would correct it there because you want your maps to look right so people can see and interact with the data in real time as it's transcribed, but that's not true for everyone's projects. Does anyone else in the audience have a comment on these dimensions of quality and how you think about them?
Victoria Van Hyning: This is Victoria. I just have a brief parallel comment to that, or adding a little bit to the old weather example, which is that the volunteers and the science team work together to come up with some post processing to strip out those illogical things. It actually, I think originally came from a volunteer's desire to map the ships, that was actually separate from the research team's agenda. Just an interesting example of volunteer input.
Victoria Van Hyning: But one of the things that I think we have a struggle with, with strings of text that are letters, rather than numbers, is that that correction is, there are endless permutations and it becomes very tricky. So I'm just raising that as a, I guess interest in hearing if anybody has tried to do any cute post processing to clean up common errors that arise from the original documents? And whether anybody's ever worked with a data set or released a data set in which there's, as it were, the raw data and then the cleaned up, whether that's something that went through an automated process of cleanup, or by hand?
Austin Mast: I will say that we had a NSF grant that was funded through the RAPID program mechanism at NSF, to improve data for horseshoe bats and relatives. And horseshoe bats are the known reservoirs of the closest relative to SARS-CoV-2, out there in the world, and an important part of understanding the origins of the pandemic. What we did was we took about 90,000 records. These are collections that date back to the 1700s, they're curated by museums around the world. We considered the data in aggregate, so we were looking for outliers like what you're talking about. We went through and we were improving the data using something that the original transcribers or geocoders, or whoever didn't have, which was all the other data. It's tough to know what to do then. So I'd suggest that that was the state of the possible, what we ended up with at that point in time.
Austin Mast: A major issue that arose was, what do you do with that? You've gone through and you've assessed all of that data. We tried to make it very easy for the collections to ingest our data back into their systems, but these are often overworked curators who don't really have even the knowledge of how to do that in a sophisticated way. So it's a challenge for our community to ingest and back into our databases the outcomes of the use of the data. In this case, it was use of the data that was very much focused on data cleaning. So you could think about data cleaning is happening at the point of, you've got the data set completed, you've got all the transcriptions and you can look for outliers, but there will also be some much broader comparisons that are made if your data's useful. And that's another step at which you might begin to get those, I don't know, I like to think of them as annotations back into your system.
Sara Brumfield: That's a great way to think of it. That's definitely how we think of it in FromThePage where we link... Not all of our projects, but some of our projects link verbatim, like place names, entity names to canonical ones. So they can do some of that as they're doing usually a second pass, but then you have some standardization without actually changing the original transcript, it's just a link. I'm going to move us on to the next topic because it's almost half an hour passed, so we got to keep moving. Our next topic is about the causes of errors in crowdsourced transcription. Both Ben and Austin have some interesting analysis and stories here. Ben, why don't you start us off?
Ben Brumfield: Okay, I'm going to share my screen. One of my big pandemic projects was trying to figure out what sources of errors in transcription projects come from? Where do the errors come from? I did a lot of reading of textual scholars and textual critics who, for the last several centuries, have been trying to figure out where errors came from in different copies of the Bible or Homer, or the Arthurian legends, or the other kinds of things they were working with. They say, "Well these two copies I have are different, why are they different? Which one is correct?" So they came up with a huge list of types of errors that copyists make, right? Scribal errors. I'm particularly influenced by Eugene Vetiver's Principles of Textual Emendation paper that he published in 1939, in which he classifies all the different errors he's observed, into when they happen as the scribe is moving from the text that he's copying, that exemplar up at the top of this illustration, down to the text he's writing, the copy text. Right?
Ben Brumfield: He says, "Here are the things that happen as someone is writing," like their pen might slip. "Here are the things that might happen as someone is reading the problems," they might actually misread a word. "Then there are these other things that happen in the transitions of the eye between the place that they're writing and the place that they're reading from," and that's actually perhaps the biggest source of errors. I wanted to see if those kinds of errors applied to modern transcription projects, and John Dougan's team at the Missouri State Archives was kind enough to share the results of the raw data of their 1970 death certificate eVolunteer portal. eVolunteer is this field-based system that shows a death certificate to at least two users, sometimes three, and they transcribe it independently, then the results are collated and resolved by the staff members, and John, feel free to come in and correct me if I'm wrong.
Ben Brumfield: But the point is you had multiple transcriptions of the same image and you could look at the places where they didn't match to evaluate the kinds of errors. I classified a thousand different errors that I found in the first several days worth of transcription work, of these kinds of documents. Here's what these documents look like, and you can imagine transcribing these into just an html, a web form. So what do we find? Misreading is something that definitely medieval scribes did, but this is something that happens a fair amount. This is where the user simply misreads what they see. This word is bland and I can tell that from context on other kinds of entries, but the person has transcribed it as the word blood. Similarly Welcenia instead of Helcenia. You can't blame people, this is hard stuff.
Ben Brumfield: Another example of an error is the same thing as a slip of a pen if you were a medieval scribe. This is a true type, right? Very clearly this word is hull, the user has written H-U-L instead of H-U-L with two Ls, right? Evet for Evert is very similar. There's no way you could misread these things, this is just somebody slipping up. A really interesting case of this, and this is something that I think would not have happened for manual copyists, is this kind of typo, Nekvub Hewek. Can you tell what happened here?
Ben Brumfield: The user has accidentally shifted their finger position on the keyboard over one to the left on their right hand, so those letters, anything they type with their right hand are going to be wrong. This is really interesting in my opinion because it is a place where the traditional textual scholarly classifications break down. No one would ever do this if they read the word Melvin Jewel and then were writing down the word Melvin Jewel, and it's an example of errors that touch typists can make, who aren't actually looking at the places that they're writing.
Ben Brumfield: Another example of that is what I call tab skip. This is a case at which somebody is just going through this form and tabbing from field to field and they get off by one. So they start filling in values that might be correct, but they're filled in, in the wrong fields. There are other kinds of errors that aren't purely mechanical. One of those is interference, and that is when perhaps someone walks by and says a number while you were transcribing someone's age of death, right? And you just type the number in that you've heard. This could also be when you are reminded of something, right? It's when the mind wanders during this process, so lemons instead of Lemmons, obviously the person is thinking of a lemon. J as a letter instead of the name Jay. Is another example of this.
Ben Brumfield: A particular type of interference is hyper correction. This is when the user does not type what they see, but they type a more standardized form of what they see, right? So in the name Anna Bell, the person who typed it in, instead of writing Bell like a Bell you ring, wrote the much more common woman's name Belle with an E. Similarly, instead of Annettie, the user has written Annette, which is a more common English name.
Ben Brumfield: Back to examples of things that people have done for centuries. One of those is eye skip, and this is when your eye accidentally skips from the place where you are reading to another place, as your eye is moving back and forth from the place where you're typing. Here the father's name is supposed to be John W Bartlow, and the user has written Barton because they wrote John W and then they looked up again and they saw the county, which is right next to the father's name, also starts with a B-A-R, their eye picked up the wrong thing and they typed that down. The example below with Hale instead of Caudill is just perhaps a more egregious example, but it's the same thing. The user is looking back in the place where they're expecting it, their eye is off by a little bit and they type the wrong thing down.
Ben Brumfield: There are also simple failures to follow instructions. This is where the protocols that Austin was talking about, the way a project was supposed to run, are not being followed. These protocols, these Missouri eVolunteer projects say, if you're typing someone's name, don't add these comments that the clerk might have written, saying, "Oh, this person's deceased," right? Actually write down the name. Similarly, the instructions say don't include suffixes like Jr or Sr, but we have users who are doing that anyway.
Ben Brumfield: A rare kind of error in English language documents, but one that appears in languages that have more gender, and it's not just a gender distinction, but is switching gender or switching something to its opposite, so you might read hot and accidentally type cold. Here is an exhibit, it often happens with changing forms of words from masculine to feminine, or back. Here we have the user named Louis George Sally. I suspect what happened here is that the user transcribing this, read Louis George Sally and had Sally in their head, which is a woman's name, and so wrote down Louise instead of Louis.
Ben Brumfield: So how do these appear in the thousand errors that I identified? It turns out that they're all pretty common. The hyper corrections, the true typos, these all appear a lot with the exception of polar errors and tab skips. I think that there are, however, opportunities to focus, those of us who are writing software or writing interfaces, on ways to improve these. One of the challenges that the Missouri example had was that the form layout that the users were typing into, didn't quite match the form layout of the death certificates. So the natural tabbing through, doing the next thing, required the user to hunt around on the death certificates, and I believe that caused a lot of the eye skip problems and some of the interference problems.
Ben Brumfield: Yeah, so my conclusion is that the traditional textual criticism, classification of errors is actually pretty useful because it lets us focus on the kinds of errors people make and maybe improve our processes, or maybe know what to look for when we're looking for these kinds of anomalies. I'm going to stop sharing.
Sara Brumfield: I think Austin, you had a story to tell about the early days of Notes from Nature that ties into understanding what types of errors people can make, or might be making.
Austin Mast: Yeah, I was just looking at the results of that. Back in, maybe 10 years ago, we started a project on Zooniverse, and we were in the testing phase and we had this error happen in the image handling of a set of images, maybe there were 5,000 total. They were transferred to a server and then from there, at some point the images were corrupted, everything above a certain file number became the same image. What happened was we had, in the end, 2,820 transcriptions of the exact same label, which you'd never do intentionally because it's wasting people's time, unless you're doing an experiment. And we were still in the testing phase, but we found it really interesting to look at how... Because that gave us a statistical distribution of people's responses to the exact same label.
Austin Mast: I'm not going to go into great detail about it, we found it to be useful as we were designing the project. And as Ben was observing, there were predictable errors, some of these errors were much more high frequency than others. One of the things that we noted was that there were many more errors in spelling when the words were longer. So if you had a short word like... I think the name of the specimen was sedum integrifolium, and sedum was rarely misspelled. Integrifolium was often misspelled, relatively speaking, and in part because it's not a familiar word for people, I think. I imagine some of these names, well, they're quite unfamiliar. So it was an interesting, what we might call in biology, a natural experiment, something that happened and you don't plan for it, but you take advantage of it when it does happen.
Sara Brumfield: I'm curious if anyone else in the audience... Or did you have more to say, Austin? I'm sorry.
Austin Mast: Oh, Victoria asked if it was written up and it was not written up. I wrote it up in something of an internal report to the group that was developing the platform, but no, I didn't have time to publish it.
Sara Brumfield: I'm curious if anyone else on the call has done the sort of analysis errors, or looking for patterns of errors in transcription project?
Victoria Van Hyning: I will put a link to a paper that Sam Blickhan, who's one of the editors, leads on the... I always say Collective Wisdom, is that right?
Ben Brumfield: That's right.
Victoria Van Hyning: Yeah. Handbook, and she and I and some of our other Zooniverse collaborators wrote that a couple of years ago, and it gets into a smaller set of, I think just like 19 or 20 documents. But yeah, looking at some of those same issues.
Sara Brumfield: Anybody else have a comment or question on this section of our discussion?
John Dougan: Ben, were there errors that you were expecting with the data analysis, that you did not see?
Ben Brumfield: That's a really good question. The problem is I did this about a year ago, so I'm trying to remember what my preconceptions were. I was surprised by the number of typos, I thought that there would be a lot fewer typos. I was astonished by the number of errors that I would classify as interference, where just random things would show up or random corrections would show up. I was really pleased by how little hyper correction there was, I expected that to be a really, really common problem that would just outstrip everything else, and it turned out to be no different from problems like eye skip. So I will say the results were surprising, I was not expecting this even distribute... I frankly thought that 90% of the errors were going to be exactly the same kind, and it turns out everybody messes up differently.
Sara Brumfield: And unfortunately that makes it harder to write software to catch those errors, right, when we have more variation? Okay, our next topic is the two main approaches that we use to improve quality in crowdsourcing projects. I'm going to ask Austin to tell us about the multitrack transcription approach and then we'll have Ben talk about the single track transcription.
Austin Mast: Yeah, thanks. Multitrack is having multiple participants do the same thing and then compare and cross them. We do this on our project on Zooniverse, and Zooniverse more broadly does this. But we have code... I did a paper with Andréa Matsunaga and José Fortes back in 2016, in which we started to look at the convergence of these contributions, among many other things, to determine what the optimal number of responses were in this particular case. What we were exploring was whether or not we could make things much more efficient. At the time we were showing a transcription job to 10 people and we thought that's a lot. So we started to look at whether or not we could assess on the fly whether or not there was convergence on a particular answer, and in the end we found that there was often convergence for each of the fields that we were focused on, around three transcribers. So we started to just set it to three and asked for three people to transcribe each of the fields for this project.
Austin Mast: Since then, Julie Allen, who's at the University in Nevada, Reno, wrote some code that compares the three transcriptions for us, and red flags the ones where there's disagreement, and also reconciles the three. This is quite easy when it's a dropdown, you can just do majority rule, but if it's something like habitat and you're taking habitat from a label, it can be quite a bit of variation in those three text responses. I'll drop some things in the chat, links to her GitHub for that code.
Austin Mast: We implemented it then in Bio Specs, which is a project that I'm involved in, that is providing an interface for people to go through all of the red flagged fields, and nothing more. If you have 98% agreement across all your fields and subjects, we're not going to revisit that as expert reviewers. But those 2% we want to look at, and we produced an interface that allows users to go through one at a time, and look at those things. Julie's code uses some fuzzy matching to determine whether or not there's a threshold of agreement, and which of the three entries is featuring qualities that are valued by the team, so I'll drop that in there.
Ben Brumfield: That's really interesting. I mean, I feel like one of the challenges the multitrack approach often has, is that, while it's great about reducing bias, you still have to end up with something usable. Right? You still have to say this is the actual last name of this person, among your 10 options. Or this is the actual habitat, and it sounds like you've got this quality review and consolidation tool to help you do that. Is that what I understand?
Austin Mast: Yeah, that's right, we've got that pretty much fixed. I could show you the interface in Bio Specs if you're interested, but I don't want to take time away from your description of single track.
Ben Brumfield: Yeah. Single track methods are usually ones in which the same image, as any transcription project is working from images, is shown to a user who creates a transcription, and then the image and the transcription are shown to subsequent users. You don't have this blind double keying, multiple keying of the same image by independent people, you have a sequential collaboration as multiple people look at the image and look at the previous people's contributions, right? Look at the transcription and hopefully improve them.
Ben Brumfield: It's an interesting approach because it is oftentimes more appropriate for really, really large material. So if you're dealing with just a species name in a single field or if you're dealing with perhaps a habitat, it is fairly straightforward to do the kind of comparison that Austin is describing, to figure out whether or not two people's transcriptions match. But if you're dealing with an entire page full of text from a 19th century letter, it's a lot more difficult because you have all of these accidental differences that a computer would recognize, without perhaps any substantive differences. And then trying to determine whether the two transcriptions agree or not, is only one part of the problem. The other problem is, once you've decided they agree well enough, which one do you pick?
Ben Brumfield: One of the benefits about the single track approach is that you have usable data at every stage of the process. The potential problem though, is that that usable data might have been created by someone who didn't produce high quality data. Right? It might've been someone who's new to the project and so those often need to be reviewed. So there are projects that we run that have staff members review every volunteer contribution, and approve them. There are other projects that we run which have volunteers reviewing each other's material, and other people as well, right? The Smithsonian Transcription Center, essentially two different people have to review every page and agree and say, "Okay, this page transcript is good enough."
Ben Brumfield: Obviously this can be labor intensive, it's just labor intensive in a different way from showing the same image to multiple people and having them transcribe them. So it's a challenge. It's one thing that we find can work... We always have to have conversations with projects to talk about the importance of their data quality. If someone is working on colonial records and just trying to create a full text search index of them, they often have relatively low quality needs. If someone is working on marriage certificates from the 1980s as part of vital records that people currently need for death certificates or any kind of hospital treatment, those have really high needs for quality, and so you're willing to put more labor in.
Sara Brumfield: So this is a pretty labor intensive process no matter which way you go, either with your volunteers or with your staff, and I know some of our projects use volunteer reviewers and different formats. Can we talk about how to make that easier on volunteer reviewers or on project staff?
Ben Brumfield: Yeah. I'm going to talk a little bit about our most recent efforts that were funded by the Council of State Archivists to make single track review something that could be spot checked, to try to reduce the labor or direct the labor in that process. Unfortunately, it's going to require a little bit of screen sharing because I've got some screenshots from what we did in our application. And Sara, I'd encourage you to jump in here too if you like.
Ben Brumfield: Okay. In a single track system, every time someone changes a transcript, you get the side effect of data about the change. So hopefully they improve the transcript, but you can see the difference between the previous version and the version that you created. Here's an example, this is a FromThePage page version where you can see someone who has made a change, you see all of their changes in green and you see all the previous text in red with strikethroughs. Right.
Ben Brumfield: We have been doing this for years just as ways of making the editing process more transparent, but what we wanted to do was to be able to use that to figure out, statistically, what needs attention. Right? So the act of approving a transcript and get that transcription to a point at which it's considered good enough by the project staff, gives us information because we know how many letters were changed during that review and approval process. We can use it, we can attempt to use that to find out information about the person who did the work, or information about the difficulty of the material. Right? This particular county clerk had terrible handwriting, whereas this other one was great and so people may need to review the one with the bad handwriting a lot more than the one who used a typewriter, for example.
Ben Brumfield: What we tried to do is calculate a delta during the approval process, and that delta is between the last person who edited a page and the final state of the page at which it was approved. This is an example, we're actually using a Levenshtein distance algorithm, which is a little bit more sophisticated than this, but that gives us this idea of how good was the page and the state that the approver found it, compared to the state in which they left it? In order to make this useful, we need to roll it up and aggregate it. We aggregate it in two ways. We take all of the approval deltas for pages, and we aggregate them by the work that those are in. That lets us know, well this county clerk, everything that they did, users are really struggling with this guy's handwriting because all of our approvals are making lots of changes.
Ben Brumfield: We also roll that up by the last person who transcribed. So we can say, "If you see work that was last touched by this one person, here's what it's likely to be." Normally you'll find users who are very concerned with everything being correct, who will go through and review each other's work and make changes to get it into a great state. So you can say, "Wow, well anything that this person has worked on, that they were the last editor on, it's going to be great. It maybe needs a little spot checking, but it'll be fine." But sometimes you find users, and this is something that happened with the British Library Arabic scientific project documents because they were dealing with medieval manuscripts with old Arabic. After the project was almost finished, one new user came in and started normalizing all the spelling into modern Arabic spelling, right? So everything they touched, they were actually degrading the fidelity of. So okay, maybe that person needs a lot more work.
Ben Brumfield: We look at these, all of the ways that we evaluate this is specific to a given project. Just because someone does a great job on one project, they might not do a great job on a totally different project, or vice versa, and we do the same thing for works. The way that we populate all this information to get this evaluation process started, is by doing a random statistical sampling of a representative set of pages. We present pages with their transcripts to approvers, stripping out information about who transcribed them, because we don't want to bias the approval process. So we let a user go in, make their fixes and approve the process, and then we show them another page out of this quality sampling. After time this gives us enough information that we can see how different users are doing or how different material is doing, and that lets people go in and review one particular user.
Ben Brumfield: This is a user, Heidi Marie, you can go in and see everything that she did. One of the things that we found is really important is the context, so if there were any notes or discussions about this, we want to print that out. Then the approver can either review these individual works, these individual pages, or they can approve them all in bulk. Right? Hopefully if you're starting with 10,000 pages needing review, you can review a subset of perhaps 500 pages and get enough information to make targeted decisions about what work needs to be done, what further review needs to be done, and what work doesn't. That's the statistical sampling approach that we've adopted most recently, it's currently being used by some of our partners in COSSA. We hope it works. I'm really curious if other people have done these kinds of quality spot checking approaches, and how they did it, and how they worked, too.
Sara Brumfield: Doesn't sound like really anyone has. The reason we built this... Oh, go ahead, please.
John Dougan: I should probably turn my video on, I'm sorry. This is John with the Missouri Project. We actually assign arbitrators for our third pass, so if the first two passes are not correct. We use a scoring system as well. What's interesting is, and I'm not sure of the code on it, is when you get assigned to be a third pass, we immediately get more difficult work because the first two people... And score plummets whenever you get that third pass, so it's just really interesting, that quality of work. But that's really the only thing that I would say that might add to this discussion, is that we use a similar kind of system and we have noticed... Or a related system, it more falls into Austin's area, but we've noticed that that third pass, the statistics drop pretty precipitously because they're getting the reject work of everyone else.
Ben Brumfield: John, can users see their own scores? Because this came up in a discussion I had on Twitter about this kind of feature.
John Dougan: Absolutely not.
Ben Brumfield: Okay, that's great.
John Dougan: Yeah, although you were accurate in your previous slide, showing me to have an extremely poor score for data entry, and Christina would be better than me, but yes, absolutely. So they don't. And then the other thing that... You've used this example from us before too, is that whenever we have a specific problem with a person doing data entry, we send them a generic email or generic sounding email, but we only send it to them. And we monitor constantly, problems, when we notice an recurring problem... And Leanna Twinty, our volunteer coordinator is on the call as well, she sends them a nicely worded email, saying, "Hey, we've noticed this problem going on," but it oftentimes only goes to them, or if two people are making the same mistake, it only goes to those people.
Sara Brumfield: Yeah, which is a great segue into the next thing we wanted to talk about a little bit, which was intervention, right? Now that you have all of this data and these different approaches to identifying quality and when you can identify quality, are there ways people are doing intervention? And how do we improve transcriber quality? And when do we improve it? So John's example is one that we love to share because it's transcribers are volunteers and we want to respect them and their sense of significance, and not make them feel bad. Because if you make them feel bad, they will go away, right? So how do you manage this quality improvement process in a gentle way?
Ina Schäfer: Maybe I can say something. I am the community manager for our project, so it is my turn to send those emails and I can say it is sometimes really hard to find the right words to not insult the people, but tell them, "Well maybe you should, a bit, maybe." But we do another way, we send it to all of them so if they chat between each other, talk to another, it will not come up. So, "I didn't get this mail, did you?"
Ina Schäfer: So I go the way to say, okay, I send a mail to all of them and say, "We recognize sometime there are these mistakes, maybe check it." And if there are really hard mistakes or people where we see many mistakes, I will send a mail to another one, good in this section. We have text from the 1600s to the 1900s, so I have to check who's good in which century and say, "Well maybe can you have an eye on this pages and check them?" So I know there's a good one to check them up. And I would say this is working really good at the moment. This is just a year running now, but I had a yeah, good response on that, I think.
Sara Brumfield: Great, great. I know when we were building in this quality review and sampling feature in FromThePage, one of the things, the first thing we built was actually ways of seeing brand new transcribers to a project. Because brand new transcribers are still learning and so if you can look at their work and review it early and see their first three pages, and see how they're doing and if they're improving, you can intervene with one of these gentle correction methods earlier in the process. And hopefully that flows through all of the rest of their contributions.
Ben Brumfield: One of the interesting things about Ina's concern about volunteers talking to each other, is that your project Ina, is very different from many of these other projects because it's a single city and it's not even in a very large city. And you've got people coming in, in person to the archives, working on this. It had not occurred to me that that might require a very different approach than John's situation in Missouri, or ones like that.
Ina Schäfer: Yes, but they also meet in Zoom because they are working from at home, and sometimes we are not in because they are managing each other. After that I once heard, "well, they got that mail and I didn't, but why?" Okay, okay, and I went down to send them all to all of them. So I think it could be a problem even for those projects which are just online, in many cities, if the volunteers meet each other in meetings and discuss things.
Sara Brumfield: And that may be virtual places too, like forums or Slack channels, which we know some projects also have. Yeah, I think that's interesting. Are there other intervention strategies that people use? Anyone else on the call?
Austin Mast: I will note that the Zooniverse has a very nice forum in which people can talk about subjects and ask questions of each other. I'll go into that forum and look at the project that I'm involved in maybe once every three weeks, and just go through and answer questions as a researcher. That's an opportunity for people to talk amongst themselves, and we have some expert participants who moderate those forums so that they guide people in the right direction.
Sara Brumfield: FromThePage does comments on the actual page that the people are transcribing or working on, and those comments get surfaced in an activity feed for the entire project, that anyone can see. Then they're also surfaced in a nightly email to the project donors, and hopefully that's a way to prompt project donors to intervene and answer those questions early so people feel listened to, but also are corrected and helped earlier in the process.
Sara Brumfield: Okay. Are there things that we haven't addressed in this whole conversation, that people wanted to hear about or talk about? Do we have any thoughts on what the next steps of this journey... Actually just this conversation has given me a lot of ideas on ways that we can do improvements to crowdsourced transcription quality, but how else can we reduce staff time? How do we improve quality? I mean time and quality, right? They're a balance, so we raise one and not raise the other? So what do people think are the next things to go explore and to experiment with?
Victoria Van Hyning: Sara, you know I have a lot of feelings about this, so I will jump in and say I think that more conversations like this are helpful, and more papers or gray literature, whatever, that share information about how people are trying to deal with their quality. I think that this is something that there hasn't been a huge amount of joined up conversation across projects and platform types. We're not only dealing with these big questions about whether using multiple consecutive transcribers, a reviewer, multiple independent transcribers like on the Zooniverse method, but there's also just the sheer variety of documents and transcription methods that Austin touched on at the beginning.
Victoria Van Hyning: Are you trying to get what exactly it says on the page? Are you trying to get data that's embedded in the pages with an indexing project? I think that as volunteers go from project to project within a given platform or across platforms, they're navigating all of those things, and I don't know actually how well we are helping them wayfind through that, so I think that that's a potential thing. And the other thing that I'll say, another thing I have very strong feelings about also. You mentioned earlier that one or more cultural heritage institutions were struggling to get the content back in and really delivering on one of those aspects of their data, and that's a huge problem. It is partly on the training of folks, but it's also on the vendors and the systems that people are using, and that's an area of particular interest to me, so hope to see more on that soon.
Sara Brumfield: I was thinking as Austin was talking earlier, this idea that we have the original crowdsource data and then you have post processing that it goes through, I think there's a lot of really interesting techniques that you could apply to that post processing. But maybe what we need to be doing is keeping each of those data sets, each of that stage of that pipeline as a separate stage. Definitely you want the original transcribed exported data as the original, because I know with many digital humanities projects, at least, you're like, "Oh, they did this thing to the data and I don't like it. I'd like to go back to the original and do this better, newer methodology that we have."
Sara Brumfield: There's so many data portals that different states and universities are running, I think we have places to stick that now, even if the institutions don't have a way to pull it into their dams, that would be a smart thing to recommend to people who are running these projects, so I think we might start recommending it. The other thing that occurred to me thinking about this, is maybe we need to capture our transcription conventions and package that with that data, "Here's what we told people to do so that you know that when you're 10 years from now, looking at that original data and trying to figure out what happened with it," right? Those are my two takeaways from this, is things we should think about doing. And they're not super technical, they're just process and approaches.
Victoria Van Hyning: I'll just say really quickly, if you're on this call and running a project, raise your hand if you've never changed your conventions mid flow.
Sara Brumfield: Well we actually-
Victoria Van Hyning: That's important to know.
Sara Brumfield: ... want to do that, right? We want you to keep improving them so it gets easier for your transcribers. But that is also a problem. We don't version those, and maybe we should.
Ben Brumfield: Maybe we should. We actually had a volunteer change our conventions at one point. It was a new project, we were just reusing some conventions from a previous project and she was the main person working on it. And about, I don't know, 30 pages in, I see this comment saying, "Look, this is the wrong way to do this. Here is what I'm going to do when I see unclear text or additions or deletions. And I've done the research and your instructions are wrong." And the thing is, she was right.
Sara Brumfield: Yeah, we've seen that multiple times where volunteers have been in dialogue with the project owners, because many of our volunteers have more experience doing transcription than the project owners do, right? They go from project to project, they do this because they love it and they are experts, right? And we have to respect that and learn to iterate and learn from them.
Jesse Karlsberg: I was just going to... Hi, this is Jesse Karlsberg from Emory where I know we do have a herbarium, I don't know if it's in any way ever been connected to Austin, to your work. I work on a couple of different projects and they all involve OCR and then OCR correction, rather than starting straightforwardly from transcription. And Austin's framework noted that agents could be computers or humans, or the like, and we've mostly been talking about human agents and then some computer programming, some software that deals with what human agents do. I have ideas in my mind, I guess, about the kinds of errors that OCR systems tend to make, which it would be really interesting to do an... And they're changing all the time I guess, and at a very rapid clip because of the shift from algorithmic to AI-driven OCR, and OCR advances are so rapid. But it'd be interesting to compare OCR errors or even errors in historical OCR, OCR that was performed three, four, five years ago, or 10 years ago, and see how that relates to the kinds of errors that human agents make.
Jesse Karlsberg: Then I also just wonder if there is anything particular to OCR correction as opposed to transcription, around errors. I have the same guess that Ben had, that hyper correction is going to be an issue in the project that I'm working on. And historical OCR, much less today, but historical OCR seems to hyper correct because it often uses dictionaries. But I also guess that when you eliminate... It's really, in certain ways it's not so different from... It's almost automatically a single track approach because you have an individual, a human agent working on top of what a computer agent has done. But yeah, I wonder whether the starting point being OCR, makes certain kinds of errors more likely than others. I don't know.
Austin Mast: Yeah.
John Dougan: There's an ongoing project that really illustrates that well with the 1950 census, between the National Archives work that was done with Amazon Web Services and a privately developed family search ancestry algorithm that has tremendously better accuracy to it, in the mid nineties really, for very, very complicated text. So if you all could get your hands on some of that data, I think you'd find some really interesting.
Austin Mast: I'll also comment on that. I saw a really novel use of OCR. Sometimes, as I'm sure you've found, you get garbled results from OCR, and Jason Best at the Botanical Research Institute of Texas was seeing this in his OCR results, and he ended up using it as a way to recognize the fingerprint of particular collectors' labels. So it was just meaningless stuff that he was getting out of OCR, but he could cluster it and he could double check his database by these clusters, which I thought was novel.
Sara Brumfield: That's fascinating. Very, very cool.
Jesse Karlsberg: That's lovely, and that I see somebody linked this viral text project and Ryan Cordell, who's one of the scholars involved in that, also has an essay that he's published that's all about the sorts of research questions you can ask using messy OCR data. And that's something I'm just super interested in because I work on one project where we really do care about having a diplomatic transcript of the original and we're going to use OCR as a means to reach that. We have another project where it's totally out of scope and we're also using musical images and OMR, optical music recognition, lags far behind OCR and is a totally different problem in certain ways. But I know that there will be things like what Jason came up with, where there are things that we'll be able to learn and I'm super... So thank you for that. That's so great. That's great to hear about.
Austin Mast: I'll also put in the chat a link to a new project that was funded last year by the National Science Foundation, that incorporates OCR and AI, and what is thought to be a more rapid digitization pipeline. It's called DigiLeap.
Sara Brumfield: Well, that sounds fascinating too. I think we're about done. Does anybody have any sort of final thoughts, or Ben, or Austin, would you like to wrap up what we've chatted about today? I'll let Austin find the DigiLeap link, so maybe Ben, do you want to do any closure?
Ben Brumfield: Well, thank you all for joining us, I really appreciate the conversation and all of your contributions. As Sara mentioned, this is an experimental format for us, so if you have pointers, if there are things you hated about it or something, send me an email with recommendations for the future and what you wished we'd done differently. Yeah, thank you all. As Sara mentioned, our next webinar in July is going to be on digital scholarship and teaching from primary sources with FromThePage, so I hope that those of you who are interested, that I'll see you there.
Sara Brumfield: Great. Thank you Ben and Austin in particular, we really appreciate what people bring when they come from different domains. Evan from demographics as well, having those three ideas floating around was really cool. We really appreciate everyone who came and participated in the discussion, and we hope to see you on later webinars. Thank you.
Austin Mast: Thanks so much for inviting me, and it was nice to see all of you. I'm having trouble finding that link. Maybe they don't have a web presence up yet, but it's something to watch for, it's called DigiLeap.
Victoria Van Hyning: I've created a shared notes doc, if anybody wants to add that and/or any other things that came up today, that would be great. And then Sara, I'll send you the link.
Sara Brumfield: You did drop it in the chat.
Victoria Van Hyning: It's in the chat, yeah.
Ben Brumfield: I just dropped it again. And Victoria, we will probably try to add the chat logs to that shared notes document.
Sara Brumfield: Very cool.
Victoria Van Hyning: All right. See you all.
Sara Brumfield: Thank you all.