Fair and Authentic Representation of Marginalized Communities in AI Data
The rise of AI technologies, from recommender systems to LLMs, is fueled by the massive amount of data about people and our world. Most of the data used for training AI models were scraped from the web or collected by companies from their users. While the current AI data practice created a host of privacy and copyright issues, one tangible and serious challenge for marginalized communities – such as Black people and people with disabilities – is that they are often under and misrepresented in the data that shapes AI systems, and, as a result, not able to benefit from the technology advancement, but worse, subject to algorithmic biases and harms.
The lack of fair and authentic representation of people with disabilities has been called out by scholars and activists as a key challenge and priority for today’s AI fairness efforts. The solution? We believe it lies in the hands of the communities themselves.
Here we will introduce a grassroots, community-led efforts of an online community of people who stutter, in collecting and curating one of the first and largest dataset of stuttered speech in Mandarin to improve the inclusivity of speech recognitions models powering all types of speech interfaces and automatic phone menus these days.
Closely partnering with the StammerTalk community in this project since Day 1, we have been consistently impressed by the community’s proactiveness, determination, and resourcefulness, and uncovered huge advantages of the community-led data practice compared to data scraping or commercial data brokerage.
By sharing our experience and findings from StammerTalk’s stuttered speech collection project below, we advocate for a new AI data paradigm of community data stewardship, especially for data from and about marginalized communities. Together, we aim to develop a practical guide and introduce new socio-technical infrastructure to support grassroots, community-led data collection and stewardship for communities that have historically faced underrepresentation in AI data.
StammerTalk (口吃说) is an online community of Chinese speaking people who stutter that convene over WeChat groups and biweekly online support groups. It currently has approximately 500 members distributed across the world, but primarily from mainland China. StammerTalk operates solely through its volunteers, especially a core team of 10 volunteers who self-organize virtual events, activities, and support the community.
Stuttered Speech Data Collection Project
The idea of creating a stuttered speech corpus for better ASR popped up in a conversation between StammerTalk and AImpower in late 2022. We decided the goal is to collect 100 hours of stuttered speech recorded from 100 individuals who stutter. The StammerTalk team will recruit people who stutter to participate in the recording session, and each recording session will generate about one hour of speech data – half of it would be free form conversations and another half would be command recitation.
The project was led by Rong Gong, one of the founders of StammerTalk, together with Lezhi Wang, a long-time and active member of the StammerTalk community. The data collection process itself was kicked off in January 2023, and over 60 hours of speech data has been collected as of September 2023.
Some important prep work that are necessary for the data collection include:
- Identify Partners : To gather technical, operation, and legal resources and support for the project, Rong and the StammerTalk team prepared project pitch and established partnership with AImpower.org (technical advisory), AIShell (speech annotation service), Michigan State University (stuttering research advisory).
- Build data collection and annotation guidelines: Rong led the work to establish the protocols and guidelines for collecting and annotating stuttered speech, with the support from Lezhi (StammerTalk), Jia (StammerTalk), Shaomei (AImpower.org), and Xin Li (AIShell).
- Establish legal framework: Collecting and curating personal, biometrics data from community members across the globe requires tremendous legal advice and expertise on data & privacy laws, IP laws, and International laws. Through AImpower.org, StammerTalk acquired legal guidance and contractual support for this project from Cooley LLP, a renowned global law firm. Also, since the community (represented by StammerTalk & AImpower) were in the driven seat, we were able to draft a participating process and agreement that offered greater transparency and rights to data contributors compared to what is provided in commercial data collection by companies and data vendors.
- Recruit data contributors. Rong posted the first recruitment poster on StammerTalk’s public WeChat account in Jan 2023, detailing the goals and structure of the data collection. The recruitment was met with enthusiasm from the community: over 40 people who stutter signed up within a few days. A second batch of recruitment was conducted in July 2023.
- Data collection
After reviewing and signing participating agreements, Interested data contributors were each scheduled for roughly 60-min video-conferencing session with a community data collector – either Rong or Lezhi – over Zoom or Tencent meeting. Each session followed the following structure:
- Preparation (5 mins): Participant orientation and consent collection.
- [Recorded] Unscripted Spontaneous Conversation (30 mins): An unscripted conversation centered between the participant and the community data collector on the participant’s life and stuttering experiences.
- [Recorded] Voice Command Recitation (30 mins): Participants read out loud a list of frequently used commands for speech interfaces.
The second and third part of the session make up the final stuttered speech dataset. The recorded stuttered speech was then transcribed by professional annotators from AIShell, with stuttering events annotated.
Advantages of Community-led AI Data Collection
To understand the process, benefits, and challenges for community data stewardship, the AImpower.org team closely followed the progress of this project, and collected ethnographic data about this process through observations, interviews, and surveys with data collectors (Rong & Lezhi) and contributors.
Our data and analysis showed unique advantages of community-led AI data collection over commercial and mainstream data practices. We saw that community members were intrinsically motivated to participate in the data collection process, and they found the process highly enjoyable despite the significant amount of time and effort involved. Besides a useful technical output (i.e. the dataset), the data collection process also created deeper interpersonal connection and a sense of empowerment within the community, strengthening the community’s capacity and agency for self advocacy.
Driven by love, not money
Contrary to what was seen in commercial or 3rd party led data collection, most people who participated in StammerTalk’s community-led data collocation cared relatively little about the monetary incentives, but were driven by intrinsic goals such as making meaningful contributions to the community and connection with other people who stutter.
Lezhi, one of the two community data collectors, shared her motivations for spending countless nights and weekends working on this project:
“I want to publicize my stutter… I want to empower myself through stuttering. I want to differentiate myself from others, from people who do not stutter. My longstanding involvement with the stuttering community gives me insights into the unique challenges faced by stutterers. These (insights) equip me well with ideas on leveraging technology to improve experiences of people who stutter, especially since current technologies often overlook their needs”– Lezhi
Similarly, most data contributors participated because they recognized the value of this dataset to the stuttering community and wanted to contribute to and connect with the community. As shown in the figure below, when we surveyed 55 data contributors about their reasons to participate in the data collection, the top reasons are 1) “meaningfulness of this initiatives”; 2) “contributions to the stuttering community”; 3) “support StammerTalk’s projects”, and 4) “opportunity to talk to StammerTalk team 1:1”. And “monetary compensation” was rated as the least important by more than half (29/55) of the data contributors.
Enjoyment, Knowledge, and Self Empowerment
While commercial data collection processes are often boring, repetitive, and tedious, participants of the StammerTalk data collection actually enjoyed this experience and found themselves gaining more knowledge about stuttering, deeper empathy from others, and a sense of empowerment with their identity as PWS.
According to our survey, 95% of the data contributors rated their experience in the data collection as “satisfying” or “very satisfying”, and the positive experience was created by factors including: an opportunity to make valid contribution to the community; the relaxed and comfortable setup and social dynamics during data collection; the opportunity to speak to another person who stutters about one’s stuttering experience; and gaining new and deeper knowledge about stuttering.
The community data collectors played an important role in making data contributors feel comfortable and heard. While shared experiences with stuttering instantly brought the data collectors and the contributors psychologically closer, data contributors also acknowledged specific behaviors of Rong and Lezhi that made their experience satisfying and enjoyable, as shown in Fig. 2 below.
In fact, the data collection sessions were so fun and comfortable that many of the data contributors stuttered less frequently than they normally do, and as a result, the data collectors needed to remind the data contributors to voluntarily stutter, or simulate a stressful situation (e.g. job interview), to increase the variety and frequency of stuttering in the dataset. This process prompted many of the data contributors to revisit their default relationship with stuttering – that stuttering is something to be avoided and concealed in one’s speech, turning stuttering into something meaningful, desirable, and a unique quality of ourselves. For many, this shift in perspective was profoundly empowering. One data contributor shared the most important thing he gained from participating in the data collection:
– Data contributor
The courage to face my true self, and accept my stuttering behaviors
Current Challenges with Community-led AI Data Collection
Despite huge tangible and intangible impact for the stuttering community, this project did not run without challenges.
First, collecting and annotating speech data at this scale is labor intensive, requiring serious commitment and donation of personal time and resources from community data collectors and contributors.
Before kicking off the data collection, Rong has spent significant time and energy to allocate partners and resources, and to train professional speech annotators – who do not stutter – to transcribe and annotate stuttered speech. Rong and Lezhi have also spent a few hours of each week on data collection for the past 10 months. Both of them have full time, demanding tech jobs, and had to squeeze their personal and family time during evenings and weekends to work on this project. Some necessary monetary costs also occurred during the data collection – e.g. 100 RMB per data contributor for their time, and Rong often covered those costs out of his pocket.
Second, the lack of adequate socio-technical infrastructure for community data stewardship creates liability risks and uncertainties for the community:
- Open-sourcing datasets: There are ongoing debates on how to manage and share AI datasets, especially datasets with personal information. While the community was incentivized to open source this dataset to maximize its impact on speech AI, the speech data included in this dataset carries unique characteristics of stuttered speech and would be hard to fully anonymized, thus, it remains an open question on how to responsibly manage and govern the use of this dataset in consideration of scientific values and privacy implications.
- Rigid Legal Frameworks: Existing data protection models are structured around traditional, distinct roles including data subjects (e.g. users and consumers), data controllers (typically companies and organizations), and data processors (usually data vendors or analytics providers). Those roles and assumptions break down in community data stewardship, as the community is collectively taking up all the roles. This becomes particularly challenging with grassroot, marginalized communities, whose membership and legal status are often fluid and informal.
- Geopolitical Complexities in cross-sector collaboration on AI Data. The collaboration between StammerTalk and its diverse set of academic, industry, and nonprofit partners across China and the US added an extra layer of intricacy to the project, especially when working with personal data in a politically charged climate. The current tension between the US and China in technological innovations, especially AI technologies, has added liability costs and additional clearance steps for partner organizations to engage and support this project despite its clear value for the stuttering community.
- Navigating Cross-Border, International Data Laws: As a distributed, online community, the data collection process inevitably involved cross-border data transfer and International data protection laws. Navigating these regulations and compliance requirements could become a serious barriers for marginalized communities to undertake similar efforts.
Ingredients for Success
Despite structural challenges, we have also identify some unique characteristics of the StammerTalk community and their data collection process that contribute to its success, namely:
- Technical Expertise: Having in-house expertise or committed partners with technical know-how to ensure that the project has a solid technical vision and feasibility.
- Resourcefulness: Sourcing pivotal and necessary resources from community members and partners.
- Reputation and Trust: Cultivating trust and relationships within the community to set the foundation for participation, collaboration, and a good experience for everyone.
This project is still ongoing and we expect to wrap up the data collection process by the end of 2023. Our initial analysis of the collected data showed evidence of the diversity and representativeness of stuttering speech patterns within this dataset, and saw promises in using this dataset to effectively tune existing ASR models for stuttered speech. Partnering with the StammerTalk team, we will conduct more technical research and analysis of the collected data, and benchmark popular ASR services to understand their performance disparities between stuttered and non-stuttered speech in Chinese.
If you are part of a marginalized group that are interested in starting similar AI data initiatives, we are happy to share our experiences and resources.
If you are a researcher or a scholar studying fair AI data practice, we are eager to exchange ideas and learn from you.
If you are a researcher or a developer of speech AI technologies and are interested in learning more about this dataset, we are happy to start a conversation.
If you are a SLP and are interested in using this data for educational, research, and clinical purposes, we want to hear about your use case.
In sum, we are actively exploring this new data paradigm and would be eager to partner with anyone who is interested. Please reach out to email@example.com.
Join us in building a future where fair and empowering AI data practice isn’t just the exception but the norm!