We benchmarked the leading speech-to-text (STT) providers, focusing specifically on healthcare applications. Our benchmark used real-world examples to assess transcription accuracy in medical contexts, where precision is crucial.
Benchmark results
The average WER results of our tasks show that Deepgram is the leading speech-to-text provider for healthcare in this benchmark.
Methodology
Dataset
We wanted to evaluate the tools transcription accuracy in a specific area, so we decided to conduct two tasks:
Task 1: Healthcare voice data
- Total number of samples: 100
- Total duration: 9 minutes and 25 seconds
- Average duration per sample: 5.65 seconds
- Content: Healthcare voice data including medical terminology, patient interactions, and clinical discussions
- Variety: Different speakers, varying audio quality, and diverse medical contexts spoken in English
Audio specifications:
- Format: WAV
- Channels: 1 (Mono)
- Sample width: 16-bit
- Sample rate: 16 kHz
- Consistent bitrate: 256 kbps
- Duration range: ~4.5 to 11.5 seconds per file
Task 2: An anatomy lecture
- Total number of samples: 1
- Total duration: 8 minutes and 35 seconds
- Content: An anatomy lecture given by a doctor, including medical terminology
- Variety: One speaker speaking in English, in the first half of the video, music plays in the background.
Audio specifications:
- Format: WAV
- Channels: 2 (Stereo)
- Sample width: 16-bit
- Sample rate: 48 kHz
- Consistent bitrate: 1536 kbps
Evaluation metrics
We used Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics for transcription accuracy. Word Error Rate is calculated as:
WER = (S + D + I) / N
Where:
- S = Number of substitutions
- D = Number of deletions
- I = Number of insertions
- N = Total number of words in the ground truth
The formula calculates the minimum number of word-level operations needed to transform the hypothesis into the reference, divided by the number of words in the reference. Lower WER indicates better accuracy, with 0% being a perfect match.
The Character Error Rate (CER) is calculated by dividing the total number of character-level errors (including insertions, deletions, and substitutions) by the total number of characters in the reference text.
We used speech-to-text APIs to transcribe audio files to text.
The maximum file size input in one time of the providers is shown in the table:
Note: For providers with smaller file size limits (like Google and OpenAI), larger audio files need to be split into smaller chunks before processing. We performed that in Task 2.
Speech recognition
Speech recognition enables computers to transcribe audio files into text, with the help of machine learning algorithms. A transcription service’s API can be used with various programming languages to batch transcription. These platforms support both real-time and asynchronous transcription.
Speech recognition technology has numerous applications, including transcription, voice assistants, and language translation.
Benefits of using speech recognition for transcription
- Fast transcription of audio files
- Time and effort savings
- Real-time transcription and translation
- Accessibility for individuals with disabilities
How do speech-to-text AI tools work?
The transcription process includes:
- Audio data is uploaded or streamed to the speech-to-text tool
- Usage of machine learning algorithms to analyze the audio data and identify patterns in speech
- The tool converts the speech to text using a speech-to-text engine
- The transcribed text is then displayed to the user.
FAQ
What are the applications of speech recognition technology?
Transcription of audio and video recordings can be used in:
Voice assistants and virtual assistants
Language translation and interpretation
Speech-to-text (ASR) systems for individuals with disabilities
What are the features of leading speech-to-text providers?
Their pre-trained models enable automatic speech recognition (ASR) for recorded audio and video files. High-accuracy audio transcriptions include automatic punctuation and topic detection.
An open-source engine or a speech recognition provider from a service your company already works with (i.e. Google Cloud, AWS transcribe) can be chosen as the transcription solution to your company’s needs. Some of them also provide free credits but we recommend being cautious about data security.
How to convert audio files to text?
A speech-to-text API can help to transcribe audio files into text. Processing and analysis of audio data:
Audio data is processed using techniques such as noise reduction and echo cancellation
The audio data is then analyzed using machine learning algorithms to identify patterns in speech
The algorithms use acoustic models and language models to recognize spoken words and phrases
Converting speech to text using machine learning algorithms:
Machine learning algorithms are trained on large datasets of audio and text data
The algorithms learn to recognize patterns in speech and convert them into text
The algorithms can be fine-tuned and customized for specific use cases and languages
Source link
#Word #Error #Rate #Comparison