• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar

FromThePage Blog

Crowdsourcing, transcription and indexing for libraries and archives

  • Home
  • Interviews
  • crowdsourcing
  • how-to
  • Back to FromThePage
  • Collections

Subject Spotter: Our NEH Grant for LLM Aided Entity Recognition and Identification

September 13, 2024 by Sara Brumfield

We’re excited to announce that with our collaborators from the Civil War and Reconstruction Governors of Mississippi (CWRGM) Project, Lindsey R. Peterson of the University of South Dakota and Elizabeth La Beaud from the Mississippi Digital Library, we’ve received a National Endowment for the Humanities Digital Humanities Advancement Grant for “Subject Spotter: Automation & Subject Tagging Historical Texts.”

Why now? 

We’ve been exploring and experimenting with LLMs – publicly, often in this newsletter – for about 18 months. We’ve learned that prompt writing is a skill that can be as tricky as programming. The grant carves out time for us to really experiment and practice with the right prompts for this kind of work.

We’re also excited that OpenAI has introduced a type of structured data output (JSON) to API queries. That’s going to make it a lot easier for programmers to get clear and usable results from prompts, and we’re hoping the other LLMs will follow suit.

Why CWRGM?

Our application was strong, in large part, because of the five years the CWRGM team has put into marking and identifying entities by hand. That means we have 12,000 documents to use to fine-tune models and to evaluate the results of LLM identification. It also means we can constrain the entities we ask the LLM to match to ones we already know about. My gut says that a constrained, known, world of entities should lead to higher quality work by the LLM (just like humans!).

The other advantage of an active digital documentary edition project for this work is that they have a staff of professional historians who will take their existing workflows and adapt them to an “AI Assist” model of entity recognition and identification. The software will mark entities and suggest matches; GRAs and editors will approve or reject those matches (hopefully in a faster and more fun way). Their active collaboration means we’ll build better software – not just for “fitness for purpose” (the matching!), but for the ease of use in reviewing the matches.

What exactly are we building?

Two things: 

  1. A software library (a Ruby gem) that will take a page of text, and a list of entities, send it to a LLM, and return pages with marked and matched entities. 
  2. An interface in FromThePage that takes that output and presents it in a user-friendly way for humans to approve, reject, or choose from a list of potential entities.

The library will interface with OpenAI, but also other LLMs. LLMs are a moving target, so we’re hoping the framework we build in the library will be extendable as new models and services emerge.

I’ve just focused on entity identification in this newsletter, but there’s a document level subject heading matching that we’ll also be working on (and evaluating with the Mississippi Digital Library) that uses the same strategies and software, just with different “entities” (subject headings in this case) and entire documents.

If you’re interested in following our technical work, the best way is to follow Ben on Mastodon. We'll discuss designs and progress there as we work. We’ll also have occasional updates or webinars mentioned in this newsletter as well.

Filed Under: Uncategorized Tagged With: newsletter

Primary Sidebar

What’s Trending on The FromThePage Blog

  • 10 Ways AI Will Change Archives
  • How to Handle Racial or Ethnic Slurs &…
  • An Archivist's Tale Podcast - The Power of These…
  • How LLMs Work & A Handwritten Text Recognition Sandbox
  • An Interview with Joseph Riedel of Fort Worth Public Library
  • Conversations at the Washington Library Podcast:…

Recent Client Interviews

An Interview with Candice Cloud of Stephen F. Austin State University

An Interview with Shanna Raines of the Greenville County Library System

An Interview with Jodi Hoover of Digital Maryland

An Interview with Michael Lapides of the New Bedford Whaling Museum

An Interview with NC State University Libraries

Read More

ai artificial intelligence crowdsourcing features fromthepage projects handwriting history iiif indexing Indianapolis Indianapolis Children's Museum interview Jennifer Noffze machine learning metadata newsletter ocr paleography podcast racism Ryan White spreadsheet transcription transcription transcription software

Copyright © 2025 · Magazine Pro on Genesis Framework · WordPress · Log in

Want more content like this?  We publish a newsletter with interesting thought pieces on transcripion and AI for archives once a month.


By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.  We never sell your information.