Disparities in Whisper’s Automatic Speech Recognition Performance on Disfluent Speech

By Charan Sridhar (Research & Engineering Intern at AImpower.org)

Introduction to Speech to Text Software

Automated Speech Recognition (ASR) technologies are something most of us take for granted. These technologies process spoken language and convert it into written text, and most of us use them nearly every day through tools such as Siri, Alexa, and Google Assistant. More broadly, they are used for a wide array of applications, including transcription services and accessibility tools.

However, Automated Speech Recognition, like many applications of Machine Learning, tends to overlook minorities and marginalized groups. Machine learning involves training computer models on data to understand patterns, and this process lends itself to unintended inequities. Most training data is made up of majority groups, so when Machine Learning models are trained, they learn to understand majority groups while minorities are neglected.

ASR is no exception, and one example of this is that African American speakers encounter many more errors due to the biased makeup of the training data. Another example of the bias of ASR technologies is that while ASR systems perform well with fluent speech, they tend to struggle with stuttered speech because their training data is primarily made of fluent speech.

These limitations reveal an inherent bias in ASR technologies, which tend to be optimized for “standard” speech patterns, sidelining those who do not conform to these norms. Such inequities are particularly pronounced for individuals with speech disfluencies, where frequent pauses, repetitions, or prolonged sounds lead to poor recognition accuracy and reduced usability. Stuttering affects about 1% of the world population, which amounts to around 3 to 4 million people in the USA alone, and while this is a minority, it’s still a significant population. In this study, we benchmark OpenAI’s Whisper model performance on stuttered speech from the SEP-28k dataset, highlighting the discrepancies and discussing the need for more inclusive speech recognition models.

Transcribing & Benchmarking SEP-28k

SEP-28k stands for Stuttering Events in Podcasts, and the 28k is the number of clips in the data. It is a dataset created by Apple containing audio clips containing stuttered and fluent speech obtained from public podcasts that mostly involve people who stutter and interviews with others who stutter. The clips in the dataset are labeled with five event types, which are blocks, prolongations, sound repetitions, word repetitions, and interjections. Apple created the dataset to help with stuttering event detection, and due to this, it has no transcriptions of the words spoken. To create benchmarks using this data, we need transcription, so we manually transcribed all the audio clips.

Process

Isolating the Desired Audio Clips from SEP-28k:

SEP-28k has over 28,000 audio clips, which is far too many for us to transcribe ourselves manually, so we decided to evaluate at least 400 of each speech type(fluent, blocks, prolongations, sound repetitions, word repetitions, and interjections). SEP-28k classifies audio clips by having three reviewers go over each clip. Each reviewer will say whether they think a clip contains each speech event, and this means that each audio clip will have a score from 0 to 3 for each speech event with a value of 3, meaning all reviewers agree that the speech event is present. We then ranked the clips for each speech event by unanimity.

Transcribing the Audio Clips:

We then went through each stuttering event and randomly picked audio clips that had a unanimous consensus on the current stuttering event. We manually transcribed the audio clips using the same annotation approach as the StammerTalk dataset. Annotations involved labeling stuttering events with certain symbols and brackets to allow us to compare ASR performance to both the semantic and literal meanings of the audio clip. When transcribing, we noticed multiple mistakes with the original labeling, so we created our own manual labeling. For some of the speech events, there were less than 400 unanimous clips, so we started using the clips with two reviewers’ agreement. In the end, we had more than 400 clips for most speech events because many clips had more than two stuttering events, and errors in the original labeling meant we transcribed extra audio clips beyond the expected 2400.

It was clear that the labels in the data were not created by people who stutter because the errors made often involved an inability to hear the nuance and change of voice when someone stuttered. For example, when a speaker drags out “ummmm” then it is often just them thinking rather than an actual prolongation. When someone stutters, they often change their tempo of speaking, change their breathing, or their voice becomes strained. A small pause where someone’s voice is strained is a block, but a long pause where someone is thinking and their voice sounds fine is fluent speech.

Here is the breakdown of speech events in our data:

Fluent: 542
Blocks: 400
Prolongations: 403
SoundRep: 506
WordRep: 450
Interjections: 694

Evaluating Whisper:

We used the OpenAI large-v3-turbo Whisper, which we called using the OpenAI API to ensure that we would obtain results in a reasonable time because running the model locally was too slow. The model ran inference on each audio clip that we manually transcribed. Once we had obtained all of Whisper’s predictions, we compared Whisper’s prediction to the literal and semantic manual transcriptions of each audio clip. The literal transcriptions included all the stuttering events, such as word repetitions, and it would be like this: “when [when] are you guys getting,” In contrast, the semantic transcriptions would drop the stuttering events and would be “when are you guys getting.” This difference is useful in determining whether Whisper focuses on literally transcribing the words spoken or understanding their meaning at the cost of exact word accuracy. We measured Word Error Rate (WER), Bilingual Evaluation Understudy (BLEU) score, and BERT Score F1 value for both Semantic and Literal transcriptions. We also evaluated the prevalence of hallucinations in speech categories.

WER is a metric that evaluates the model’s direct result rather than the meaning of that result. It looks at the percentage of incorrect words out of the total transcription. It is calculated by:

$WER = \frac{S + D + I}{N}$

Where S is the number of substitutions, D is the number of words deleted, I is the number of words inserted, and N is the total number of words in the transcription.

BLEU is a metric that measures the geometric average of the precision scores of different n-grams in the predicted transcript. N-grams are sequences of words that are n-words long. So, a unigram is just one word, and a bigram is two words in sequence. For this project, BLEU-2, which is BLEU with the geometric average of unigrams and bigrams, was used.

BERTScore is a metric that measures the semantic similarity of the two transcriptions using embeddings from the BERT model. It aligns tokens from the candidate text with those from the reference based on cosine similarity, capturing meaning rather than exact word matches.

Results

**Figure 1: Graphs of WER, BLEU, and BERT F1 for all speech types for literal transcriptions for the large whisper model.**

We begin by analyzing the WER, 2-Gram BLEU, and BERTScore F1 scores for the literal transcriptions that do not have the stutters removed. Figure 1 shows the WER, 2-Gram BLEU, and BERTScore F1 of the large Whisper V3 model for each speech type when evaluated on the literal transcriptions. As expected, we see a large gap in performance between general stuttered speech and fluent speech. There is a 20% drop in Word Error Rate, a 22% increase in BLEU, and a 13% increase in BERTScore for stuttered speech compared to fluent speech.

The performance on fluent clips is lower than reported by OpenAI largely because the clips in SEP-28k are only 3 seconds long, which is shorter than the 10-15s clips of Common Voice 15 that Whisper was benchmarked on. The shorter clips of SEP-28 give Whisper less opportunity to understand the context of the audio and contribute to the lower performance.

With the more detailed breakdown of performance for stuttering subcategories, the graph shows that the model was more capable of understanding Interjections and Word Repetitions compared to Sound Repetitions and Blocks, which the model heavily struggled with. When comparing Sound Repetitions and Word repetitions, WER was 20% higher, BLEU was 25% lower, and BERTScore was 15% lower.

**Figure 2: Graphs of WER, BLEU, and BERT F1 for all speech types for semantic transcriptions for the large whisper model.**

Overall, Figure 2, which has semantic transcription references, shows a similar performance trend to that of Figure 1, which had the literal transcription comparison. There is still a large gap between Fluent and Stuttered speech, and we see a gap of 25.4%, 68%, and 20.1% for WER, BLEU, and BERTScore between Stuttered speech and Fluent speech.

When it comes to the stuttering subcategories, the model still does best on Word Repetitions and Interjections and worst on Sound Repetitions and Blocks. The BLEU and BERTScore are 40% and 23% higher, respectively, when comparing Word Repetitions and Sound Repetitions.

**Figure 3: Percent of Substitutions, Insertions, and Deletions for Literal Transcriptions of each Stuttering Type**

In Figure 3, we can see that sound repetitions and blocks have very high numbers of substitutions and insertions, which suggests that whisper hallucinations are very common in those types of speech. On the other hand, we again see significantly better metrics for word repetition and interjection when compared to other stutter types. The difference in performance between Sound Repetitions and Word Repetitions is especially evident in this graph.

Such drastic differences between stuttering types with word repetition and interceptions have the best performance, while sound repetition and blocks have the worst. This is another aspect of the data that highlights the fact that Whisper was optimized for fluent speech. Interjections and word repetition are disfluencies common among fluent individuals, as people often repeat words or add interjections just because they are thinking or nervous. In contrast, sound repetitions and blocks are characteristic of people who stutter. This disparity between disfluencies that are more common amongst fluent individuals and disfluencies that are common among stutterers highlights the fact that Whisper was optimized for fluent individuals.

**Figure 4: Percent of Substitutions, Insertions, and Deletions for Semantic Transcriptions of each Stuttering Type**

While these results are mostly similar to those of the literal transcriptions, there are many more insertions overall. We can also see an extremely big jump in the prevalence of insertions in Word Repetitions and also a large jump. This suggests that whisper does transcribe double words, and so when compared to the semantic transcription, which drops it, there is a high insertion rate.

**Figure 5: Comparison of Whisper’s Semantic and Literal Performance for Stuttered Speech**

We can see that the model does not let the presence of stuttering affect its understanding of the audio because we can see extremely similar values for the BLEU and BERTScore for both Literal and Semantic. Semantic has a minimal improvement, but this is negligible. Semantic WER is notably higher than Literal, even if it is just by 3%, which shows that Whisper is transcribing the stutters in its text but not letting them affect its understanding.

**Figure 6: Comparison of Whisper’s Semantic and Literal Performance for Stuttered Speech**

In Figure 6, we can once again see the gap in hallucination frequency between stuttering clips and fluent clips. Most noticeable is the large discrepancy in hallucination frequency between stuttering categories. Once again, we can see that Sound Repetitions, Blocks, and Prolongations have by far the worst performance out of the stuttering types. In contrast, Word Repetitions and Interjections have very few incidents of hallucinations, with Word Repetitions having more hallucinations than fluent clips.

Conclusion

The rise of Automated Speech Recognition (ASR) technologies has benefited many, but it has also created new accessibility barriers for people with speech impediments, such as stuttering. Our research addresses this challenge by benchmarking OpenAI’s Whisper v3 using the SEP-28k dataset to assess its performance on stuttered speech. The results reveal noticeable differences between Whisper v3’s accuracy on stuttered versus fluent speech, highlighting ongoing inequities in speech-to-text models. We see significant gaps between stuttered and fluent speech across all metrics including hallucination prevalence.

Continuous effort is essential to eliminate biases in the technologies we rely on daily. Independent researchers play a crucial role in identifying these issues, driving the push for a more fair and equal world.

Disparities in Whisper’s Automatic Speech Recognition Performance on Disfluent Speech

Tech4ALL Digest, Nov 5

AI, Data, Community – CSCW 2024 trip report

charansr

About Us

Find us here

Disparities in Whisper’s Automatic Speech Recognition Performance on Disfluent Speech

Tech4ALL Digest, Nov 5

AI, Data, Community – CSCW 2024 trip report

Tech4ALL Digest, Nov 5

AI, Data, Community – CSCW 2024 trip report

Introduction to Speech to Text Software

Transcribing & Benchmarking SEP-28k

Process

Isolating the Desired Audio Clips from SEP-28k:

Transcribing the Audio Clips:

Evaluating Whisper:

Results

Conclusion

Share this:

Like this:

Related

charansr

Related posts

What’s New with AImpower.org: Sharing Our Technical Work, Progress, and Impact

A New Partnership to Empower Stuttering Voices in Speech AI

AImpower.org’s 2024 in Review: A Year of Growth and Movement Building

Discover more from AImpower.org