Towards Fair and Inclusive Speech Recognition for Stuttering
Current speech recognition tools perform poorly for people who stutter. A primary cause is the lack of representative, diverse stuttered speech data when developing ASR models. To address the gap, we co-created the first stuttered speech dataset in Mandarin Chinese with a grassroots community of Chinese-speaking people who stutter. Read our paper here.
Around 1 in every 100 people stutter.
They often experience stigma and discrimination in social, romantic, educational, and professional settings.

Automatic Speech Recognition (ASR) systems perform poorly for people who stutter (PWS).
ASR technologies are prevalent in our communication ecosystem. Speech interaction is particularly important for devices that have small or no screens.




As stuttering severity increases, ASR error rates for consumer systems like conversational telephone speech agents also increase.
A primary cause of poor performance is the lack of accessible, representative, and authentic stuttered speech data when developing ASR models.
The StammerTalk dataset addresses the gap in stuttered speech data.
AImpower.org partnered with StammerTalk (口吃说), an online community of Chinese-speaking people who stutter, in a community-led effort. This is the first and largest corpus of stuttered speech in Mandarin Chinese.
The StammerTalk dataset captures a wide spectrum of stuttering frequency and patterns across 72 PWS in scenarios, providing a much more authentic and comprehensive representation of stuttered speech for ASR models.

Voice Command Dictation
&

Unscripted Conversation
->

Data resembling real-world speech product use cases
We audited two open-source ASR models with the StammerTalk dataset to benchmark performance on Chinese stuttered speech.

We tested two types of transcriptions
1. Semantic: excludes word repetitions and interjections
2. Literal: stuttered utterances kept verbatim
->


ASR models found on 
->
Character Error Rate (CER)
Substitution (SUB)
Insertion (INS)
Deletion (DEL)
Error rates we measured
For both models, there are more errors as stuttering severity increases.

The Whisper model tends to “smooth” transcriptions by deleting words with low semantic value.
The wav2vec 2.0 model performs 1.5-2x worse than Whisper, with more substitution mistakes.
Further analysis showed that CERs are higher for voice command dictations compared to unscripted conversation.
The average dictation CER% is over 2x higher for severe stuttering. Voice command dictations are often used in speech interfaces and ASR-mediated interactions, so higher error rates could lead to accessibility barriers and psychological harms for PWS.
Adequate and authentic representation of the disability communities in AI data remains a challenge.
Despite the unprecedented scale of the StammerTalk dataset, stuttered speech remains immensely underrepresented in ASR. The models we tested performed worse for PWS, highlighting a major shortcoming in these systems. To close these performance gaps in ASR technologies, we need to create datasets that reflect diverse speech patterns such as stuttering.
| Mild Stuttering | Moderate Stuttering | Severe Stuttering | |
| Conversation | 17.7% | 20.7% | 31.0% |
| Dictation | 25.7% | 32.8% | 56.6% |
Interested?
Partner with us to ensure that speech recognition technology is inclusive for all.

Researchers & Scholars
We’re eager to exchange ideas with and learn from people who are studying fair AI data practices. Learn more about our data [here]

Developers
Interested at building inclusive speech AI models for your applications? Request access to our Dataset

Speech-Language Pathologists
If you are a SLP and are interested in using this data for educational, research, or clinical purposes, we want to hear about your use case.

Contact Us!
Check out our other recent work at Blog page. If you’re interested in joining us, please reach out at partnership@aimpower.org

