As AI feeds on big data, the inadequate and often biased representations of marginalized groups in AI datasets has one of the root causes of AI biases and discriminations. Commercial and institutional efforts to include marginalized communities in AI datasets are not rewarding or empowering: community contributors are often treated merely as “data subjects”, with little agency over what and how data about them would be used and shared.

In response, we are explore a community-led AI data model with marginalized groups to create and manage their data for AI use. We have been piloting this model with StammerTalk, a grassroots community of people who stutter, to create one of the largest stuttered speech datasets for fair and inclusive speech recognition.

As their technical partner, AImpower.org supported StammerTalk to create and manage the very first large scale stuttered speech dataset in Mandarin-Chinese. With 50 hours of conversational and command reading speech from 70 people who stutter, this dataset provided tremendous technical and social values in advancing the development of stuttering-friendly ASR models and raising public awareness on the struggle and demands of the stuttering community. Our analysis of the process and the resulting dataset demonstrated that this data model not only produced high-quality stuttered data for AI, but also profoundly empowered the community, building new capacities, social ties, and a stronger collective identity.

We are pushing this work forwards with the global stuttering community by collecting stuttered speech in more languages, co-designing and co-developing inclusive speech AI user experiences, and creating a power-sharing, community-led data stewardship model for data from and about marginalized populations.

In the long term, our vision is not only to create unique and useful datasets for individual communities one-by-one, but to build the socio-technical toolkits for the community-led AI data model to empower other marginalized, low-resourced communities can take charge of their own data and data-driven AI experiences.

Learn more about our data practices, learnings, ML techniques, and dataset in the Publications and Resources sections below.


NEW!
Explore our Mandarin stuttered speech dataset and ASR benchmarking!
Click here to learn more →

Publications

Datasets, Tools, Guidelines

Grants

We thank the support from Patrick J. McGovern Foundation that allows us to pursuit this work.