Recently on Twitter, FromThePage’s Ben Brumfield discussed the potential connection between scholarly editions and artificial intelligence/machine learning. What would happen if scholarly editions, a type of text historically human-dependent with commentary and annotations, were produced without human intervention? Can this be a positive use of technology, or is something objectively lost? Ben started the conversation describing how Artificial Intelligence and Machine Learning transcription technologies are not yet refined, but can already be used to produce rudimentary indexes to books without any human intervention.
More than just mechanical processes and technologies, scholarly editions are complex publications that include scholarly contributions, like annotations and commentary. In the Twitter conversation, Mike Cosgrave replied that a “scholarly edition implies more than just indexing and automatic named entity recognition” and isn’t just the mechanics of indexing entities, but something more.
In addition to a loss of unique human value, there’s a unique identity and value to a scholarly edition, in that being an “edition,” it requires human editing, a task where AI technology is limited. Hugh Cayless replied to Ben’s thread with a focus on this genre of “scholarly edition,” asking, “Can it be an edition if it hasn’t been edited? I think it’s possible to get to the point of good digital transcriptions, but absent strong AI (or altering the definition of ‘edition’), I don’t see how you get to a real edition.” AI technology is limited in the ability to recognize things that are new. AI and machine-learning processes are trained using sets of information, and as such have trouble recognizing anything that they haven’t seen before in training. Hugh pointed out that humans have the unique ability to recognize new-to-them things, while algorithms don’t know how to handle anything they haven’t seen before, which can produce many outliers. AI projects can also replicate the bias of their training data; most infamously in a criminal bail algorithms which suggested higher bail by race of defendant. With scholarly editions, this can cause problems and information loss:
“[W]hat would be lost is anything that’s an outlier. Humans can recognize stuff they’ve never seen before. Algorithms do weird random shit with stuff they’ve never seen before. Or ignore it entirely. The results, as we well know, can be oppressive.”
Hugh added that outliers could even be as simple as inability to recognize people who don’t capitalize their names, confusion in areas where a person’s name is not distinct from a place name, multilingual texts, and dialects that aren’t “standard.” This is one of the key problems with AI technologies: The way machines are trained produces a (so-called) “standard,” and as Hugh points out, anything that deviates from that standard, that is not a “norm,” may not be processed the right way, and such may be handled incorrectly or eliminated from the product altogether.
In addition to the potential loss in the product of the algorithm that Hugh pointed out, the process of a human completing the work of transcription is an important learning interaction, which is an vital part of community and public education, leading to important research questions that inform scholarly editions. Ben shared that “Transcription does produce transcripts, but it also makes a real impact on the person doing the work. We learn about the text, the language, the writer, and the subject as we transcribe. Interacting with the handwriting or the paper tells us the education level of the author (or how many drinks of whisky Bill Cody had had when he wrote a particular letter).” This is particularly true for public crowdsourcing projects where the “non-transcript by-product can be an important part of public education (as with the many projects grappling with institutional entanglement with slavery), and for scholarly projects the process can raise new research questions.”
Plus, the process of transcribing—in addition to education—is enjoyable and fun for humans to participate in. In this way, optimization of technology to support AI scholarly editions takes away the opportunity that volunteer transcribers have to participate in something enjoyable and impactful. Ben added, “There's another reason I regard the machine-created edition with some dread -- I really enjoy transcribing, as do our volunteers.”
What do you think? What's the proper role for AI in scholarly editing?
Have documents that could benefit from transcription? Reserve a call with Ben and Sara.