Speech and translation AI models developed at NVIDIA are pushing the boundaries of performance and innovation. The NVIDIA Parakeet automatic speech re

NVIDIA Speech and Translation AI Models Set Records for Speed and Accuracy

submited by

Style Pass

2024-04-18 01:30:06

Speech and translation AI models developed at NVIDIA are pushing the boundaries of performance and innovation. The NVIDIA Parakeet automatic speech recognition (ASR) family of models and the NVIDIA Canary multilingual, multitask ASR and translation model currently top the Hugging Face Open ASR Leaderboard. In addition, a multilingual P-Flow-based text-to-speech (TTS) model won the LIMMITS ’24 challenge by synthesizing a speaker’s voice into seven languages using a short audio clip.

This post details how several of these best-in-the-world models are breaking new ground in speech and translation AI, from speech recognition to custom voice creation.

The NVIDIA Parakeet family of models includes Parakeet CTC 1.1B, Parakeet CTC 0.6B, Parakeet RNNT 1.1B, Parakeet RNNT 0.6B, and Parakeet-TDT 1.1B. These models provide robust English speech transcription with a variety of options for different customer applications, accuracy, speed, and other requirements. The models come in two sizes: 0.6 billion and 1.1 billion parameters.

The effectiveness of the Parakeet CTC and RNNT models lies in end-to-end training using the fast conformer (FC) encoder, recurrent neural network transducer (RNNT) and connectionist temporal classification (CTC) decoders. For more details, see Investigating End-to-End ASR Architectures for Long Form Audio Transcription and Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.