Classifying the Mistakes We Make When We Transcribe

When we transcribe handwritten text, we make mistakes.

We misread words with difficult letters; we accidentally modernize a word with archaic spelling; we skip a line.

Everyone does it, and even though we humans don’t make as many mistakes as computers do reading handwriting, that might be small comfort for people who are trying to do their best at a difficult task.

Is it possible to classify the kinds of mistakes we make when we transcribe?

In the crowdsourcing community, we have the raw data--names which have been transcribed differently by different people, then passed through arbitration in systems like FamilySearch Indexing or Zooniverse. Words which were transcribed one way in an early draft, then corrected in a later version in systems like FromThePage or Wikisource--and could analyze those differences to look for common patterns. The conclusions of such an analysis could help us make our software easier to use, aid reviewers, and even automate some kinds of quality control.

Such an analysis has never been published by people running web-based transcription platforms.

That said, 21st century crowdsourcing volunteers are not the first people to copy text from a handwritten exemplar to a new medium; typists, printers, clerks, and scribes have been doing nearly the same thing for thousands of years. Editors and other textual scholars in classics, biblical studies, and medieval studies have had to work with differences between copies of the same text, determining which variant might be an error and which might be original (or conjecturing an original if both variants look wrong). Over centuries, they’ve created an extensive literature analyzing scribal errors, classifying them and identifying probable causes.

Could we learn from them?

Martin Worthington is an Assyriologist who has asked the same question. In Principles of Akkadian Textual Criticism, he introduces the conclusions of textual scholars from other disciplines to his own field, describing each type of scribal error and looking for examples from Mesopotamian documents.

For the most part, he finds that the classifications apply accurately, even though scribes wrote clay tablets instead of parchment or paper. Scribes were most likely to make these errors when working with unfamiliar kinds of texts or when they were sleepy, but errors also occurred when the original was damaged or the style of handwriting was unfamiliar.

Computers are not cuneiform, but I think that we all might be subject to the same kinds of forces, so let’s dive into Worthington’s framework. (I’ll paraphrase to use modern terms and examples.)

Errors of letter similarity.
When a letter in the original matches our expectation of a different letter, we may transcribe the wrong letter. Examples might be ſ (long s) transcribed as f or an un-crossed t transcribed as l.

Worthington subdivides this classification:

- Mis-readings. Reading of f in place of ſ within a word.
- True typos (Worthington’s lapsus styli) occur when we actually understand what we’re reading, but our fingers simply type the wrong key. Anyone who’s started writing with their hand off-set by a column has experienced this.

Errors of word interpretation. When we encounter a word or an abbreviation sign that we haven’t seen before, we may interpret it incorrectly. Jeremiah White Graves uses a symbol like a raised, rounded w with a line over it to abbreviate pounds. Even in context, it’s easy for a newcomer to read this as a symbol for hundredweight or cents instead of some variation on lb.
Interference by internal narration. When we read someone else’s writing, we carry their words in our head on the way to the keyboard. It’s easy for our internal narration of the text to change it to the wording, spelling, or punctuation we’d use instead of that used in the original. This can happen more easily if the original quotes a passage we’ve learned by heart, like a verse from the Bible or a passage from Shakespeare, we are very tempted to copy down the words we remember, rather than the words we see on the page.
Eye-skip (saut du même au même). These occur when we finish transcribing one line, then start transcribing a line that isn’t the next one because of similarities between the line we anticipate transcribing and the line we see.

Perhaps our eye looks for a line that ends the same way our previous line ended, finds one further down the page, and proceeds with the following line, skipping the lines in between (homoeoteleuton).

Perhaps we know how the next line to transcribe should start, but our eye picks up a different line that starts the same way (homoearchon).

Perhaps the mistake occurs when we have transcribed half of a line, look back to the exemplar to continue it, and pick up the “same” next word in the wrong line (homoeomeson).

Regardless, the result is the loss of several lines of the original in the transcribed text. I suspect that this problem is more common when transcribing tabular records like census sheets or account books than it is when we work with letters or diaries.

Word-skipping (lipography) is when we skip over a word or phrase we should transcribe.
Haplography is when something is doubled in the original, but we only transcribe it once.
Dittography is when we repeat a word or phrase that only occurs a single time in the original.
Polar errors. These happen when the original says “hot”, but a copyist writes “cold”, or replaces “big” with “small”. Apparently these are fairly common. A special case of polar errors for languages (or words) that mark gender is swapping, e.g. “sorceress” with “sorcerer”.
Errors of attraction. Worthington defines these as places where the spelling of one word is corrupted by other words near the word being transcribed. I find that most of my own typos are anticipatory. I start typing the next word before I’m done typing the first, so I suspect modern transcribers are prone to errors of attraction as much as ancient scribes were.
Synonym Substitution. These happen when we replace a word in the original source with a different word that means the same thing.
Dialect Normalization is the replacement of words written in the author’s dialect with forms (or spellings) in the transcriber’s dialect.
Cut-and-Paste Errors occurs when a transcriber saves effort by copying repeated text that actually varies in tense or spelling within the original.
Hypercorrection occurs when our transcriptions “correct” errors we perceive in the original which were not actually errors.

My own experience as a transcriber convinces me that Worthington’s classification scheme is applicable to modern users of web-based transcription software as much as to Mesopotamian scribes working with clay tablets.

I’d love to see a research project analyze crowdsourced transcripts to see how frequently each kind of mistake happens, look for additional types of errors that might be unique to digital transcription, and grapple with patterns of transcript variation that are not really errors, but differences of interpretation.

- Ben & Sara