Using Whisper to transcribe audio

Whisper (the new transcription API from OpenAI) seems like a good option to transcribe audio. For example, here I’m using it to transcribe the audio from the Japanese dub of For All Mankind (dub subtitles often don’t match the audio). There’s no sound of video of the show because of DRM but you should get the idea. It’s useful for going back and checking a word you didn’t understand, for example. I’m using [this implementation](https://github.com/ggerganov/whisper.cpp) of Whisper.

https://imgur.com/a/567o5JP

3 comments
  1. Whisper has been out for quite a while at this point, so unless they’ve made some sort of stealth update to their models their largest model is still only 85~% accurate for Japanese. If your listening isn’t good enough to not need subtitles to pick up on that 15%, you probably shouldn’t be using this, if your listening is good enough to pick up on that 15% you should probably just be watching without subtitles. This is honestly pretty niche, I used it myself for a couple shows and then never used it again because it just wasn’t worth the extra effort

  2. I’ve been doing this for months and it’s awesome to generate subtitles for things that don’t don’t have any! Only difference is I use Whisper + Whisper-WebUI to generate a .srt file (I don’t own the necessary hardware to transcribe it live) then load it into a player like Memento which allows you to hover over words and easily create Anki cards. Only thing that sucks it that it’s kinda inaccurate with the sub timings (and sometimes the wrongs words) but it’s vastly better than YouTube auto-generated subs.

  3. I’ve done a few tests with this thing when it first came out, and had really mixed results. At least at the time any sort of speech at native speed with more than one speaker talking “naturally” (e.g. sometimes talking over each other, interrupting, bouncing off what the other person just said) seemed to really confuse it, and it would just start outputting widely incorrect things after a while.

    It might be better for professionally recorded audio, but I’d still only use it to generate the “rough draft” subtitle to go over manually and fix transcription issues rather than something to learn off of.

Leave a Reply
You May Also Like