I wanted to watch a video about robots with my 5-year-old son. One problem: the video was in English, which he doesn’t understand (yet). I could have translated it on the fly as we watched, but since it was evening and he was already in bed, I had time to devise a technical solution.
How hard can it be to automate this? This thought wouldn’t have come up if it wasn’t for the current AI boom. Seemingly every day, a new AI-powered tool emerges. These tools make tasks, once considered difficult, suddenly very possible. I would just have to tie these tools together to achieve what I wanted. And even the tying together I wouldn’t have to do, GPT-4 can do that for me.
I used whisper.cpp to turn the video’s audio into a json with all speech text & timing info. The tinydiarize model (ggml-small.en-tdrz.bin) gave me better results compared to the default models. Each segment in the output is a full speaker turn. The default models output more granular sentence fragments. While the latter might be better when generating subtitles, having whole speaker turns was convenient for the next step.
I used GPT-4 to translate each speaker turn from English to Dutch. Each speaker turn was a separate request to the OpenAI Chat Completions API. I instructed GPT-4 to keep the translations the same length to fit the video, but I’m not convinced this approach made a significant difference. A more sophisticated approach would be to translate the complete transcription in one go (or in big chunks at least), because it would give GPT-4 more context. Another option would be to use a more specialized translation API (such as DeepL), but you wouldn’t be able to instruct it to return translations of similar length.