Ben attended a recent presentation by Will Hicks from the University of North Texas Libraries/Portal to Texas History at the Texas Conference on Digital Libraries. Will discussed the practical applications of AI in transcribing audio recordings, enabling improved accessibility and search capabilities for A/V materials. He demonstrated videos with the sound removed to emphasize the importance of audio in A/V materials, highlighting their limited utility without the ability to hear or read the accompanying content.
The Portal to Texas History houses an impressive collection of 116,664 A/V objects, totaling over 500 hours of content. Captioning and subtitles emerged as crucial elements for comprehensive access to A/V materials. Effective captions include speaker identification, while subtitles often lack this information and face synchronization issues with the video.
After experimenting with various vendor-driven approaches, the UNT Libraries then turned to Whisper AI, developed by OpenAI. Whisper AI can be run on your own hardware and is available for download from GitHub or is available as an API service. The tool comes with pre-trained language models, including English and Spanish. Users can select the desired model, even opting for large and accurate models that may require more robust computing resources.
Whisper AI was trained using 680 hours of A/V data, showcasing a typical bias toward popular languages. Results demonstrated varying accuracy across different languages, with Spanish, Italian, English, Portuguese, and German performing best, while Armenian, Swahili, and Maori exhibited lower accuracy levels.
Whisper AI excelled in handling names, homophones, and currencies. However, it occasionally produced hallucinated text during extended periods of silence, requiring post-processing by the UNT team to address these inaccuracies. Another interesting observation was that long stretches of instrumental music were consistently labeled as "Pomp and Circumstance," likely due to the prominence of graduation ceremonies in the training data.
Whisper AI has some limitations, including one-way English target language translations only. Bilingual transcription exhibited inconsistency, with Whisper occasionally forgetting whether it was transcribing or translating. Swear words were sometimes modified or transcribed verbatim. Transcribing vocal music posed challenges and required creative interpretation.
Will shared a case where they paid a vendor $222 to transcribe a 51-minute Spanish-language video. In contrast, the cost of translation would have ranged from $500 to $1,200. The Whisper version, which was free since they ran it themselves, provided a reasonable transcription.
Currently, there is no user-friendly interface for Whisper AI, and all interactions are conducted through code or command-line interfaces. Even with these limitations, we’re excited about the possibilities that Whisper AI and other tools offer in enhancing accessibility and search capabilities for your A/V collections.
Are you experimenting with AI for archival resources, audio or otherwise? We’d love to hear about it.