How to transcribe long audios fast with open source (colab included) - Tiny struggles

submited by

Style Pass

2024-10-15 11:00:06

In this post I will give you code that you can run yourself in Colab (or on your own machine with a GPU), that will allow you to very quickly transcribe long audios in many languages.

Let’s pick this one as an example - it’s a massive, almost 3 hour long podcast by Huberman. It’s knowledge packed!

Sure, there are services for it, one that looked promising to me is listen411 - it charges 0.06 USD for 1 minute of audio summarization + 1 USD per file, interesting!

A podcast like one I was interested in would cost me about 13 USD, because it’s a very long one. I probably listen to 5 podcasts per week or more, so let’s say 20 per month. Ugh, the costs could add up fast.

If you go one level down, there are also APIs from cloud providers, which could be a good alternative. I checked the costs of transcription in google and aws: 0.024 USD or 0.016 USD per minute respectively, not bad, but it’s still about 1-2 USD per long podcast.

So I wondered, can I do better myself? Based on my knowledge of the current state of the ML, audio transcription is a pretty much a solved problem and there are excellent models available publicly for free. With open source libraries and Colab (easily accessible GPUs) I could build a DIY solution that would be much cheaper.