Stuttering, a genetic speech disorder that affects ~1% of the population, poses unique challenges for automatic speech recognition (ASR) systems.
For 80 million people who stutter (PWS) like me, interactions with early speech assistants like Alexa or Siri resulted in frustrating experiences, punctuated by the all-too-familiar refrain: “I didn’t quite catch that, can you say that again?”
In college, I spent a couple years building mobile apps for speech therapy— years before LLMs proliferated. And while issues affecting 1 in 100 users might feel like an edge case, they are a daily reality for PWS.
Today I’m working at ElevenLabs, a leading AI audio research and deployment company—a full decade after building speech apps.
And with all the intervening progress in AI speech tech, I decided to evaluate how today’s leading AI models understand and generate stuttering.
Stuttering comes in three primary forms: repetitions (“my my my”), blocks (“m…..”), and prolongations (“mmmmm”). These varied disfluencies are disruptive to the normal flow of speech, and each present unique challenges for interpreting speech outputs.