Maintaining the accuracy of Speech-to-Text (STT) in custom domains is a challenge. Automatic Speech Recognition (ASR) engines that achieve low Word Er

How to Improve Speech-to-Text Accuracy

submited by

Style Pass

2024-04-29 22:30:04

Maintaining the accuracy of Speech-to-Text (STT) in custom domains is a challenge. Automatic Speech Recognition (ASR) engines that achieve low Word Error Rate (WER) struggle when operating in specialized contexts like medical transcription, sales coaching, and customer service. Lowering back the WER requires engineering efforts. The level of investment depends on the strategy and varies from hours to months of R&D. Below is a pragmatic strategy (flow) starting from low-cost tasks.

Every Speech-to-Text engine understands a large but limited set of words known as the Lexicon. A phrase not in the Lexicon is called Out-of-Vocabulary (OOV). When the Lexicon doesn't contain an uttered word, STT transcribes it to another (set of) phrase (s) that sound similar, increasing WER.

If you have an in-house STT engine, add OOVs to your Language Model (LM) pipeline. If you use a third-party vendor, ask if they have an API (facility) to enable you to add OOVs. For example, developers can add custom words (OOVs) via Picovoice Console to Picovoice Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text engines.