Our AIES 2025 Best Paper Award: Community-Centered AI Data Governance for Stuttered Speech
December 22, 2025
Our work “Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset” was published at ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025), a leading interdisciplinary conference that brings together researchers, practitioners, and policymakers to examine the social, ethical, and political impacts of AI systems.
This paper is a collaborative work between AImpower.org and the grassroots Chinese stuttering community StammerTalk (口吃说), co-authored by Jingjin Li, Qisheng Li, and Shaomei Wu (AImpower.org) and Rong Gong and Lezhi Wang (StammerTalk).
What’s this paper about
Many speech technologies today are trained and optimized for so-called “typical” speech. As a result, automatic speech recognition (ASR) systems often work poorly for people with diverse ways of speaking—such as older adults, people with speech and hearing disabilities, second-language speakers, and African Americans. These systems may interrupt speakers, misrecognize their words, or produce far more errors than they do for others.
As ASR becomes a routine part of everyday life—showing up in smart speakers, car navigation, voice messages, and automated phone systems—these limitations create real barriers. For many people, they don’t just make technology harder to use; they can also cause frustration, embarrassment, stress, and long-term social and economic disadvantages. People who stutter are among those most affected by these failures. Yet their voices, speech patterns, and lived experiences are rarely represented in how speech AI is designed, trained, or evaluated.
To address this gap, our work examines the technical and social value of a large-scale, community-created Mandarin Chinese stuttered speech dataset, and explores how fine-tuning state-of-the-art ASR models with this data can meaningfully reduce fluency bias.
The dataset was created by StammerTalk (口吃说) and was collected by two StammerTalk volunteers, who also stutter, through video calls. The recordings include two types of speech: unscripted conversations between the volunteer and the participant, and spoken dictation of 200 common voice commands. In total, 70 adults who stutter took part in the recordings with the two volunteers, resulting in 48.8 hours of speech data from 72 speakers.
Benchmarking ASR Models: Today’s speech AI struggles with stuttered speech
We evaluated a state-of-the-art automatic speech recognition (ASR) model (Whisper-large-v2) on Mandarin speech from people who stutter. When judged against verbatim transcripts—including repetitions and hesitations—the model struggled badly. The most striking issue wasn’t substitutions or insertions.
It was deletions. The model routinely removed repeated words and phrases, effectively rewriting speech to sound fluent. This deletions worsened as stuttering severity increased.
For speakers with severe stuttering, nearly half of the transcript content was incorrect.

Promising Fine-Tuning Results with Community-Created Data
We fine-tuned Whisper using the StammerTalk dataset—50 hours of Mandarin stuttered speech created through a grassroots community effort—and trained the model specifically on literal, verbatim transcriptions. Our goal was not to fix stuttering, but to help the system listen better.

Our results show after fine-tuning, character error rates(CER) and DEL drop sharply across every severity level. This means the finetuned model becomes much more accurate across all stuttering severity levels. Instead of rewriting stuttered speech to sound fluent, the model learns to represent it more faithfully.
- Mild stuttering
- CER drops from 16.34% → 5.8%
- DEL drops from 11.95% → 1.22%
- Moderate stuttering
- CER drops from 21.72% → 9.03%
- DEL drops from 15.77% → 1.27%
- Severe stuttering
- CER drops from 49.24% → 20.46%
- DEL drops from 26.56% → 2.29%
With a relatively small but carefully designed dataset—built with the community, not just about them—speech AI can become more accurate, more inclusive, and more respectful.
The Social Value: More Than a Dataset
While the technical results are promising, the social value of this work is as important. The content analysis of the stuttered speech dataset revealed the lived experiences of stuttering in China.
For many people who stutter in China, stuttering is not only a communication difference—it is deeply shaped by social stigma, misunderstanding, and pressure to appear fluent. In interviews and conversations captured in this dataset, participants described stuttering being seen as “defective,” experiencing shame, avoiding speaking situations, or feeling forced to hide their stutter to fit social expectations. These experiences affect education, mental health, and career opportunities, not just how people interact with technology.
Against this backdrop, the StammerTalk dataset represents something rare: a space where stuttered speech is treated as valid, meaningful, and worth preserving.
Participants also report utilizing a range of ASR products for specific use cases in their daily lives. For example, WeChat Voice Messages is commonly used for sending text messages via voice input and converting received voice messages into text. Xiaomi “Xiao Ai” serves purposes such as smart home controls and engaging in casual conversations. Despite the potential benefits, PWS face unique challenges with ASR products, including recognition errors, time-limited input difficulties, and heightened self-consciousness. Despite these issues, ASR is widely used in China due to its advantages, such as simplifying Chinese typing and improving efficiency. However, usability barriers hinder PWS from leveraging these tools effectively, placing them at a disadvantage in technology use.
For more details, check out our full paper here.
Looking Ahead: Future Work
This work demonstrates that community-created data can meaningfully improve speech AI—both technically and socially. This work is only a first step. While the fine-tuning results are promising, they also highlight how much more needs to be done to make speech AI truly inclusive.
First, we plan to expand the dataset—both in scale and diversity. This includes collecting more stuttered speech across different regions, ages, genders, and speaking contexts, as well as capturing longitudinal changes in how people speak and experience stuttering over time.
Second, our future technical work will explore model generalization and deployment. While fine-tuning improves performance on stuttered speech, we want to understand how these gains translate to real-world applications such as voice messaging, transcription tools, and live speech interfaces. This includes examining trade-offs between speed, accuracy, and user experience—especially in time-limited or high-pressure speaking situations.
Acknowledgments
We extend our heartfelt gratitude to the StammerTalk community for entrusting us with their data. We also thank the Authentic User Experience Lab at University of California, Santa Cruz, especially professor Norman Makoto Su, for stimulating discussions and collaboration on the future development of this work. This work is supported by NSF Award #2427710 and the Patrick J. McGovern Foundation.


