HuBERT: Self-supervised representation learning for speech recognition, generation, and compression

submited by
Style Pass
2021-06-15 18:00:08

The north star for many AI research programs has been continuously learning to better recognize and understand speech simply through listening and interacting with others, similar to how babies learn their first language. This requires not only analyzing the words that someone speaks but also many other cues from how those words are delivered, e.g., speaker identity, emotion, hesitation, and interruptions. Furthermore, to completely understand a situation as a person would, the AI system must distinguish and interpret noises that overlap with the speech signal, e.g., laughter, coughing, lip-smacking, background vehicles, or birds chirping.

To open the door for modeling these types of rich lexical and nonlexical information in audio, we are releasing HuBERT, our new approach for learning self-supervised speech representations. HuBERT matches or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression.

To do this, our model uses an offline k-means clustering step and learns the structure of spoken input by predicting the right cluster for masked audio segments. HuBERT progressively improves its learned discrete representations by alternating between clustering and prediction steps.

Leave a Comment