Experimenting with Bible Audio Captions

Try the Bible Audio Captions Tool →

What is This Tool?

The Bible Audio Captions tool is an experiment in creating synchronized, word-level captions for Bible audio recordings. Think of it like karaoke-style highlighting, where each word lights up as it’s spoken in the audio.

I used the Berean Standard Bible (BSB) text from berean.bible, which is in the public domain, along with audio recordings narrated by Barry Hays, available at audiobible.org.

The Technical Challenge

The goal was to automatically generate precise timestamps for every word in the Bible audio. This is called “forced alignment” - matching known text to audio recordings to determine exactly when each word is spoken.

I experimented with two different technologies to accomplish this, using Python scripts to automate the processing of multiple Bible chapters.

Technology 1: Whisper.cpp

What is Whisper.cpp?
Whisper is an automatic speech recognition (ASR) system from OpenAI. Whisper.cpp is a C++ implementation of this model.

How I used it:
Whisper transcribes audio by listening to it and converting speech to text. I then attempted to align Whisper’s transcription with the actual Bible verses.

Challenges encountered:
The main problem with Whisper for this use case was hallucination. Whisper would sometimes generate text that didn’t match what was actually spoken in the audio. In some cases, it would hallucinate entire sentences that weren’t in the recording at all. When reading the Bible, precision matters - I need the exact words from the BSB text.

I tried using Whisper’s prompt feature to instruct it to adhere to the Biblical script, but this approach didn’t solve the alignment issues. The mismatch between Whisper’s output and the actual Bible text made reliable word-level timestamps impossible for many chapters.

Technology 2: Montreal Forced Aligner (MFA)

What is MFA?
Montreal Forced Aligner is a specialized tool designed specifically for forced alignment. Unlike Whisper, which transcribes audio, MFA takes known text and audio, then finds where each word occurs in the recording.

How I used it:
I provided MFA with the exact Bible verses and the corresponding audio files. MFA then generated precise timestamps for each word by matching the text to the audio.

Advantages:
MFA had a significant advantage - it always used the exact words from my script. For example, in Genesis 2, uncommon names like “Havilah” and “Hiddekel” were spelled correctly because MFA adhered strictly to the provided text. Whisper.cpp would struggle with uncommon names.

Challenges encountered:
Despite better overall accuracy, MFA had a critical weakness: when it made a mistake on one word, everything after that word would be out of sync.

For example, in Genesis 10 (a chapter full of names which the MFA model used was not familiar with), MFA aligned everything correctly up to and including verse 21. However, at verse 22, it made an error and every subsequent word was misaligned. This cascading effect meant that a single mistake could ruin the timestamps for the rest of the chapter.

Results and Conclusion

After testing both tools on multiple Bible chapters, I achieved mixed results. Some chapters aligned perfectly, while others had significant synchronization issues. The current success rate wasn’t consistent enough for a production-ready tool.

Key takeaways:

Whisper.cpp: Good for cases where the script for an audio file is unknown and some errors are allowed. But when the script is known, it sometimes struggles to adhere to it. Hallucinations make it unreliable for my use case.
MFA: Excellent at following the script precisely but prone to cascading alignment errors that can’t easily be corrected.

For now, this project remains an experiment. The technology shows promise, but automatic forced alignment for entire Bible chapters needs more refinement before it can provide a consistently reliable transcript.

The current tool uses MFA-generated timestamps, which generally provide better accuracy than Whisper.cpp for this use case. However, users should be aware that text-audio synchronization may drift in some chapters due to the cascading alignment issues mentioned above. Causing words to show up at the wrong time in relation to the audio.