InterSpeech ’24 trip report and reflections

Tech4ALL Digest, August 27

August 27, 2024

Tech4ALL Digest, September 10

September 10, 2024

September 9, 2024

What I liked about the format of InterSpeech

As a first timer, I found myself really appreciate these aspects of the format of InterSpeech as a conference:

Poster-centric: different from most academic conferences I attended before, InterSpeech seems to be really embracing poster sessions as a main forum of content presentation and intellectual exchanges. In anytime of the main conference except during the keynotes, there are 6-8 parallel post sessions happening at different locations, and, they are PACKED! I observed way more people in the poster sessions than in the oral presentation session, which I thought is both interesting and very healthy. Instead of sitting through papers that you are less (or not) interested and asking one or two questions in the end of papers you are interested during the oral presentations, you can easily walked through the posters, locate the ones you are interested, and have a in-depth conversation with the presenter. And that was exactly what happened and perhaps what pulled the traffic into the poster sessions. Now I wish I see this format in more academic conferences.
Survey talks: not all, but several oral presentation sessions I attended started with a survey talk that gives an overview of the domain area, and I found those talks extremely informative and well delivered. They were often given by a senior researcher (most of them tenured faculty) who were experienced at delivering the content in an accessible, instructive way. I don’t know how those talks were solicited but really appreciate the conference’s efforts for having them. They provided a rapid course for SOTA knowledge in the domain areas and set the context for all the other presentations in the session, allowing me, a first timer in this conference, quickly catch up with the background and general directions in the problem space. I particularly enjoyed National Taiwan University Professor Hung-yi Lee‘s survey talk on Development of Spoken Language Models and Northwestern University professor Ann Bradlow‘s survey talk on Second Language (L2) Speech and Perception. I knew very little about both topics beforehand but they got me really interested in them now.

Paper recommendations

There are also many interesting work at the intersection of speech disfluencies, accents, perceptions, and fair speech technologies. Here are some that I encountered (including our own paper, of course 🤓):

The influence of L2 accent strength and different error types on personality trait ratings. Sarah Wesolek, Piotr Gulgowski, Joanna Blaszczak, Marzena Zygis
CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. Mario Zusag, Laurin Wagner, Bernhad Thallinger
On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition. Peter Mihajlik, Yan Meng, Mate S Kadar, Julian Linke, Barbara Schuppler, Katalin Mády
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation. Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin
Towards measuring fairness in speech recognition: Fair-Speech dataset. Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer
LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech. Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim
What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis. Nicolò Loddo, Francisca Pessanha, Almila Akdag
Self-supervised Speech Representations Still Struggle with African American Vernacular English. Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen
AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

Food for thought

While I generally enjoyed the conference, I could not help notice how the language and the thought pattern of the medical model of disability is prevalent in the community.

With “Speech and Health” being of of the five proposed themes for InterSpeech 2024, many sessions were dedicated to the understanding and processing of atypical speech. However, languages such as “disordered speech” and “pathological speech” were most often used in session names and paper titles, even for sessions that was mostly dedicated to stuttered speech – which has been increasingly recognized as a “speech diversity“, rather than “speech disorder“, by the stuttering community and their allies. I really appreciated Rong made a slide about this during his presentation, but disappointed that nobody picked up this point in the following presentations in the same session.

The top applications for technologies developed for “disordered speech“, as I heard in the conference, were often to diagnose (to support early intervention) and to remove/fix disfluencies at the speech level. Overall, I got the sense that the InterSpeech community view speech disfluencies a problem that should be get rid of for the benefits of the speaker and the listener (including the AI), which made me feel unaccepted and disempowered as a person who stutters.

Image of a slide says "Disfluency detection (and removal) can highly improve user experience with these apps.

Example: a slide that says “Disfluency detection (and removal) can highly improve user experience with these apps. But why can’t PWS have a good user experience with their disfluencies?

A potentially related observation: while another main theme this year is “Human-Machine Interaction“, very few work actually involved the direct test or evaluation of human subjects (aka target users). Evaluations with static metrics over standard or specialized datasets are common and well respected, which felt a bit incomplete to me, coming from the CHI/ASSETS land.

InterSpeech ’24 trip report and reflections

Tech4ALL Digest, August 27

Tech4ALL Digest, September 10

aimpowerwp

About Us

Find us here

InterSpeech ’24 trip report and reflections

Tech4ALL Digest, August 27

Tech4ALL Digest, September 10

Tech4ALL Digest, August 27

Tech4ALL Digest, September 10

What I liked about the format of InterSpeech

Paper recommendations

Food for thought

Share this:

Like this:

Related

aimpowerwp

Related posts

The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset

Our AIES 2025 Best Paper Award: Community-Centered AI Data Governance for Stuttered Speech

Speech AI for All: Promoting Accessibility, Fairness, Inclusivity, and Equity – CHI 2025 Workshop Recap

Discover more from AImpower.org