Subject Spotter: Our NEH Grant for LLM Aided Entity Recognition and Identification

We’re excited to announce that with our collaborators from the Civil War and Reconstruction Governors of Mississippi (CWRGM) Project, Lindsey R. Peterson of the University of South Dakota and Elizabeth La Beaud from the Mississippi Digital Library, we’ve received a National Endowment for the Humanities Digital Humanities Advancement Grant for “Subject Spotter: Automation & Subject Tagging Historical Texts.”

Why now?

We’ve been exploring and experimenting with LLMs – publicly, often in this newsletter – for about 18 months. We’ve learned that prompt writing is a skill that can be as tricky as programming. The grant carves out time for us to really experiment and practice with the right prompts for this kind of work.

We’re also excited that OpenAI has introduced a type of structured data output (JSON) to API queries. That’s going to make it a lot easier for programmers to get clear and usable results from prompts, and we’re hoping the other LLMs will follow suit.

Why CWRGM?

Our application was strong, in large part, because of the five years the CWRGM team has put into marking and identifying entities by hand. That means we have 12,000 documents to use to fine-tune models and to evaluate the results of LLM identification. It also means we can constrain the entities we ask the LLM to match to ones we already know about. My gut says that a constrained, known, world of entities should lead to higher quality work by the LLM (just like humans!).

The other advantage of an active digital documentary edition project for this work is that they have a staff of professional historians who will take their existing workflows and adapt them to an “AI Assist” model of entity recognition and identification. The software will mark entities and suggest matches; GRAs and editors will approve or reject those matches (hopefully in a faster and more fun way). Their active collaboration means we’ll build better software – not just for “fitness for purpose” (the matching!), but for the ease of use in reviewing the matches.

What exactly are we building?

Two things:

A software library (a Ruby gem) that will take a page of text, and a list of entities, send it to a LLM, and return pages with marked and matched entities.
An interface in FromThePage that takes that output and presents it in a user-friendly way for humans to approve, reject, or choose from a list of potential entities.

The library will interface with OpenAI, but also other LLMs. LLMs are a moving target, so we’re hoping the framework we build in the library will be extendable as new models and services emerge.

I’ve just focused on entity identification in this newsletter, but there’s a document level subject heading matching that we’ll also be working on (and evaluating with the Mississippi Digital Library) that uses the same strategies and software, just with different “entities” (subject headings in this case) and entire documents.

If you’re interested in following our technical work, the best way is to follow Ben on Mastodon. We'll discuss designs and progress there as we work. We’ll also have occasional updates or webinars mentioned in this newsletter as well.