Word Error Rate Comparison [2025]

We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial.

Benchmark results

The average WER results of our tasks show that Deepgram is the leading speech-to-text provider for healthcare in this benchmark.

Word Error Rate Comparison [2025] — *The lower WER means the better accuracy.

Methodology

Dataset

We wanted to evaluate the tools transcription accuracy in a specific area, so we decided to conduct two tasks:

Task 1: Healthcare voice data

Total number of samples: 100
Total duration: 9 minutes and 25 seconds
Average duration per sample: 5.65 seconds
Content: Healthcare voice data including medical terminology, patient interactions, and clinical discussions
Variety: Different speakers, varying audio quality, and diverse medical contexts spoken in English

Audio specifications:

Format: WAV
Channels: 1 (Mono)
Sample width: 16-bit
Sample rate: 16 kHz
Consistent bitrate: 256 kbps
Duration range: ~4.5 to 11.5 seconds per file

Task 2: An anatomy lecture

Total number of samples: 1
Total duration: 8 minutes and 35 seconds
Content: An anatomy lecture given by a doctor, including medical terminology
Variety: One speaker speaking in English, in the first half of the video, music plays in the background.

Audio specifications:

Format: WAV
Channels: 2 (Stereo)
Sample width: 16-bit
Sample rate: 48 kHz
Consistent bitrate: 1536 kbps

Evaluation metrics

We used Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics for transcription accuracy. Word Error Rate is calculated as:

WER = (S + D + I) / N

Where:

S = Number of substitutions
D = Number of deletions
I = Number of insertions
N = Total number of words in the ground truth

The formula calculates the minimum number of word-level operations needed to transform the hypothesis into the reference, divided by the number of words in the reference. Lower WER indicates better accuracy, with 0% being a perfect match.

The Character Error Rate (CER) is calculated by dividing the total number of character-level errors (including insertions, deletions, and substitutions) by the total number of characters in the reference text.

We used speech-to-text APIs to transcribe audio files to text.

The maximum file size input in one time of the providers is shown in the table:

Note: For providers with smaller file size limits (like Google and OpenAI), larger audio files need to be split into smaller chunks before processing. We performed that in Task 2.

Speech recognition

Speech recognition enables computers to transcribe audio files into text, with the help of machine learning algorithms. A transcription service’s API can be used with various programming languages to batch transcription. These platforms support both real-time and asynchronous transcription.

Speech recognition technology has numerous applications, including transcription, voice assistants, and language translation.

Benefits of using speech recognition for transcription

Fast transcription of audio files
Time and effort savings
Real-time transcription and translation
Accessibility for individuals with disabilities

How do speech-to-text AI tools work?

The transcription process includes:

Audio data is uploaded or streamed to the speech-to-text tool
Usage of machine learning algorithms to analyze the audio data and identify patterns in speech
The tool converts the speech to text using a speech-to-text engine
The transcribed text is then displayed to the user.

FAQ

What are the applications of speech recognition technology?

Transcription of audio and video recordings can be used in:
Voice assistants and virtual assistants
Language translation and interpretation
Speech-to-text (ASR) systems for individuals with disabilities

What are the features of leading speech-to-text providers?

Their pre-trained models enable automatic speech recognition (ASR) for recorded audio and video files. High-accuracy audio transcriptions include automatic punctuation and topic detection.
An open-source engine or a speech recognition provider from a service your company already works with (i.e. Google Cloud, AWS transcribe) can be chosen as the transcription solution to your company’s needs. Some of them also provide free credits but we recommend being cautious about data security.

How to convert audio files to text?

A speech-to-text API can help to transcribe audio files into text. Processing and analysis of audio data:
Audio data is processed using techniques such as noise reduction and echo cancellation
The audio data is then analyzed using machine learning algorithms to identify patterns in speech
The algorithms use acoustic models and language models to recognize spoken words and phrases
Converting speech to text using machine learning algorithms:
Machine learning algorithms are trained on large datasets of audio and text data
The algorithms learn to recognize patterns in speech and convert them into text
The algorithms can be fine-tuned and customized for specific use cases and languages

Source link

#Word #Error #Rate #Comparison

Word Error Rate Comparison [2025]

Benchmark results

Methodology

Dataset

Evaluation metrics

Speech recognition

Benefits of using speech recognition for transcription

How do speech-to-text AI tools work?

FAQ

What are the applications of speech recognition technology?

What are the features of leading speech-to-text providers?

How to convert audio files to text?

Recent Posts

Vivrato adds nine gaming industry veterans to its consultancy team

CommBank reports 76% drop in scam losses as new security features rolled out

Laying the Groundwork for Enterprise AI in Banking and Finance – with Leaders from EPAM and Edward Jones

Agentic AI: On Evaluations | Towards Data Science

James Lovell, the steady astronaut who brought Apollo 13 home safely, has died

Inside the Multimillion-Dollar Gray Market for Video Game Cheats

This quantum radar could image buried objects

AOL is finally shutting down dial-up

7 Best Tents (2025), Tested: Backpacking, Family, and Ultralight

Gooners Say Their Brains Are Becoming Hopelessly Hijacked by AI Smut