• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar

FromThePage Blog

Crowdsourcing, transcription and indexing for libraries and archives

  • Home
  • Interviews
  • crowdsourcing
  • how-to
  • Back to FromThePage
  • Collections

Feature: Data Maintenance Tools

April 26, 2008 By Ben Brumfield

With only two collections of documents, fewer than a hundred transcriptions, and only a half-dozen users who could be charitably described as "active", FromThePage is starting to strain under the weight of its data.

All of this has to do with subjects. These are the indexable words that provide navigation, analysis, and context to readers. They're working out pretty well, but frequency of use has highlighted some urgent features to be developed and intolerable bugs to be fixed:

  • We need a tool to combine subjects. Early in the transcription process, it was unclear to me whether "the Island" referred to Long Island, Virginia -- a nearby town with a post office and railroad station -- or someplace else. Increasing familiarity with the texts, however, has shown "the Island" to be definitely the same as "Long Island".

    The best interface for doing this sort of deduping is implemented by LibraryThing, which is so convenient that it has inspired the ad-hoc creation of a group of "combining" enthusiasts -- an astonishing development, since this is usually the worst of dull chores. A similar interface would invite the user viewing "the Island" to consider combining that subject with "Long Island". This requires an algorithm to suggest matches for combination, which is itself no trivial task.

  • We need another tool to review and delete orphans. As identification improves, we've been creating new subjects "Reese Smith" and linking previous references to "Reese" to that improved subject. This leaves the old, incomplete subject without any referents, but also without any way to prune it.

Autolink has become utterly essential to transcriptions, since it allows the scribe to identify the appropriate subject as well as the syntax to use for its link. Unfortunately, it has a few serious problems:

  • Autolink looks for substrings within the transcription without paying attention to word boundaries. This led to some odd autolink suggestions before this week, but since the addition of "ink" and "hen" to the subjects link text, autolink has started behaving egregiously. The word "then" is unhelpfully expanded to "t[[chickens|hen]]". I'll try to retain matches for display text regardless of inflectional markings, but it's time to move the matching algorithm to a more conservative heuristic.
  • Autolink also suggests matches from subjects that reside within different collections. It's wholly unhelpful for autolink to suggest a tenant farmer in rural Virginia within a context of UK soldiers in a WWI prison camp. This is a classical cross-talk bug, and needs fixing.

Filed Under: subject links Tagged With: features

Primary Sidebar

What’s Trending on The FromThePage Blog

  • Archives as an Antidote for ChatGPT
  • An Interview with Michael Lapides of the New Bedford…
  • How Do I Read Old Handwriting?
  • An Interview with Dr. Camille Westmont of Sewanee:…
  • Learn to Decipher Old Handwriting with Online and…
  • Spreadsheet Transcription in FromThePage

Recent Client Interviews

An Interview with Michael Lapides of the New Bedford Whaling Museum

An Interview with NC State University Libraries

An Interview with Richard Gilreath of the Texas State Library and Archives Commission

An Interview with Julanne Neal of the Queensland State Archives

An Interview with Andrea Meyer of East Hampton Public Library

Read More

artificial intelligence crowdsourcing features fromthepage projects handwriting history iiif indexing Indianapolis Indianapolis Children's Museum interview Jennifer Noffze machine learning metadata ocr paleography podcast Ryan White spreadsheet transcription transcription transcription software
Privacy Policy | Terms & Conditions | About Us | Contact Us

Copyright © 2023 · FromThePage.com