Tech4ALL Digest, August 27
August 27, 2024Tech4ALL Digest, September 10
September 10, 2024By Shaomei Wu
Given that we are now deep in the space of speech AI technology, I attended InterSpeech 2024 conference for the first time last week at Kos Island, Greece. It was an eye-opening experience, I enjoyed meeting researchers and practioners interested in fair speech AI in person – especially some of our collaborators such as Rong Gong and professor Mark Hasegawa-Johnson, learning new speech model architectures and techniques, and sharing our work and perspectives with technical experts in this field.
Rong Gong, our community partner from StammerTalk, presented our collaborative work on Mandarin stuttered speech dataset for fair ASR and the presentation was extremely well-received! We got lots of questions and interests from the audience, especially around how to access and develop on top of this dataset. Yes, the dataset is open and available for research and advancement in fair ASR! We are also exploring a community-led data governance model: you can request access here by describing your use case of the dataset; upon approval by the StammerTalk community, we will share the download information via email.
What I liked about the format of InterSpeech
As a first timer, I found myself really appreciate these aspects of the format of InterSpeech as a conference:
- Poster-centric: different from most academic conferences I attended before, InterSpeech seems to be really embracing poster sessions as a main forum of content presentation and intellectual exchanges. In anytime of the main conference except during the keynotes, there are 6-8 parallel post sessions happening at different locations, and, they are PACKED! I observed way more people in the poster sessions than in the oral presentation session, which I thought is both interesting and very healthy. Instead of sitting through papers that you are less (or not) interested and asking one or two questions in the end of papers you are interested during the oral presentations, you can easily walked through the posters, locate the ones you are interested, and have a in-depth conversation with the presenter. And that was exactly what happened and perhaps what pulled the traffic into the poster sessions. Now I wish I see this format in more academic conferences.
- Survey talks: not all, but several oral presentation sessions I attended started with a survey talk that gives an overview of the domain area, and I found those talks extremely informative and well delivered. They were often given by a senior researcher (most of them tenured faculty) who were experienced at delivering the content in an accessible, instructive way. I don’t know how those talks were solicited but really appreciate the conference’s efforts for having them. They provided a rapid course for SOTA knowledge in the domain areas and set the context for all the other presentations in the session, allowing me, a first timer in this conference, quickly catch up with the background and general directions in the problem space. I particularly enjoyed National Taiwan University Professor Hung-yi Lee‘s survey talk on Development of Spoken Language Models and Northwestern University professor Ann Bradlow‘s survey talk on Second Language (L2) Speech and Perception. I knew very little about both topics beforehand but they got me really interested in them now.
Paper recommendations
There are also many interesting work at the intersection of speech disfluencies, accents, perceptions, and fair speech technologies. Here are some that I encountered (including our own paper, of course 🤓):
- The influence of L2 accent strength and different error types on personality trait ratings. Sarah Wesolek, Piotr Gulgowski, Joanna Blaszczak, Marzena Zygis
- CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. Mario Zusag, Laurin Wagner, Bernhad Thallinger
- On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition. Peter Mihajlik, Yan Meng, Mate S Kadar, Julian Linke, Barbara Schuppler, Katalin Mády
- Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation. Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin
- Towards measuring fairness in speech recognition: Fair-Speech dataset. Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer
- LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech. Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim
- What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis. Nicolò Loddo, Francisca Pessanha, Almila Akdag
- Self-supervised Speech Representations Still Struggle with African American Vernacular English. Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen
- AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li
- YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli
Food for thought
While I generally enjoyed the conference, I could not help notice how the language and the thought pattern of the medical model of disability is prevalent in the community.
With “Speech and Health” being of of the five proposed themes for InterSpeech 2024, many sessions were dedicated to the understanding and processing of atypical speech. However, languages such as “disordered speech” and “pathological speech” were most often used in session names and paper titles, even for sessions that was mostly dedicated to stuttered speech – which has been increasingly recognized as a “speech diversity“, rather than “speech disorder“, by the stuttering community and their allies. I really appreciated Rong made a slide about this during his presentation, but disappointed that nobody picked up this point in the following presentations in the same session.
The top applications for technologies developed for “disordered speech“, as I heard in the conference, were often to diagnose (to support early intervention) and to remove/fix disfluencies at the speech level. Overall, I got the sense that the InterSpeech community view speech disfluencies a problem that should be get rid of for the benefits of the speaker and the listener (including the AI), which made me feel unaccepted and disempowered as a person who stutters.

Example: a slide that says “Disfluency detection (and removal) can highly improve user experience with these apps. But why can’t PWS have a good user experience with their disfluencies?
A potentially related observation: while another main theme this year is “Human-Machine Interaction“, very few work actually involved the direct test or evaluation of human subjects (aka target users). Evaluations with static metrics over standard or specialized datasets are common and well respected, which felt a bit incomplete to me, coming from the CHI/ASSETS land.


