Community-led AI Data Collection & Stewardship

As AI feeds on big data, the inadequate and often biased representations of marginalized groups in AI datasets has one of the root causes of AI biases and discriminations. Commercial and institutional efforts to include marginalized communities in AI datasets are not rewarding or empowering: community contributors are often treated merely as “data subjects”, with little agency over what and how data about them would be used and shared.

In response, we are explore a community-led AI data model with marginalized groups to create and manage their data for AI use. We have been piloting this model with StammerTalk, a grassroots community of people who stutter, to create one of the largest stuttered speech datasets for fair and inclusive speech recognition.

As their technical partner, AImpower.org supported StammerTalk to create and manage the very first large scale stuttered speech dataset in Mandarin-Chinese. With 50 hours of conversational and command reading speech from 70 people who stutter, this dataset provided tremendous technical and social values in advancing the development of stuttering-friendly ASR models and raising public awareness on the struggle and demands of the stuttering community. Our analysis of the process and the resulting dataset demonstrated that this data model not only produced high-quality stuttered data for AI, but also profoundly empowered the community, building new capacities, social ties, and a stronger collective identity.

We are pushing this work forwards with the global stuttering community by collecting stuttered speech in more languages, co-designing and co-developing inclusive speech AI user experiences, and creating a power-sharing, community-led data stewardship model for data from and about marginalized populations.

In the long term, our vision is not only to create unique and useful datasets for individual communities one-by-one, but to build the socio-technical toolkits for the community-led AI data model to empower other marginalized, low-resourced communities can take charge of their own data and data-driven AI experiences.

Learn more about our data practices, learnings, ML techniques, and dataset in the Publications and Resources sections below.

Publications

Govern With, Not For: Understanding the Stuttering Community’s Preferences and Goals for Speech AI Data Governance in the US and China. Jingjin Li, Peiyao Liu, Rebecca Lietz, Ningjing Tang, Norman Makoto Su, and Shaomei Wu. In Proc. of AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES ’25). 2025. [preprint] 🥇 Best Paper Award 🥇

Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Pages 2768 – 2783. doi:10.1145/3715275.3732179. 2025. [preprint]

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li. In Proceedings of the InterSpeech Conference. 2024. [preprint]

“I Want to Publicize My Stutter”: Community-led Collection and Curation of Chinese Stuttered Speech Data. Qisheng Li, Shaomei Wu. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW). 2024. [preprint]

Datasets, Tools, Guidelines

Mandarin Stuttered Speech Dataset @ HuggingFace

Grants

We thank the support from Patrick J. McGovern Foundation that allows us to pursuit this work.

Community-led AI Data Collection & Stewardship

Publications

Datasets, Tools, Guidelines

Grants

About Us

Find us here