What is Speech to Text?
Speech to text (STT), also called automatic speech recognition (ASR), is the technology that converts spoken language into written text. When you dictate a message on your phone or when a voice AI listens to a caller, STT is the component doing the transcription. It processes raw audio — often streamed in real time — and outputs a sequence of words along with confidence scores and timing information.
How ASR accuracy is measured
The standard metric is Word Error Rate (WER), calculated as the number of insertions, deletions, and substitutions divided by the total number of words in the reference transcript. State-of-the-art models from Deepgram, Google, and OpenAI Whisper achieve WER below 5% on clean English audio. However, accuracy degrades with background noise, strong accents, and domain-specific jargon, which is why many providers offer custom model training.
STT in voice AI pipelines
In a live phone call handled by a voice AI, STT operates in streaming mode: it begins transcribing as soon as the caller starts speaking, emitting partial results before the caller finishes their sentence. This reduces perceived latency because the language model can start processing before the full utterance is complete. Low-latency STT is critical — every extra 100 milliseconds of delay makes the conversation feel less natural.
Providers commonly used in production voice AI include Deepgram (known for speed and accuracy on phone audio), Google Cloud Speech-to-Text, and OpenAI Whisper. Choosing the right provider depends on latency requirements, supported languages, and cost per audio minute.