Skip to main content
Definition

Speech to Text (STT)

Automatic speech recognition that transcribes spoken words into written text.

What is Speech to Text?

Speech to text (STT), also called automatic speech recognition (ASR), is the technology that converts spoken language into written text. When you dictate a message on your phone or when a voice AI listens to a caller, STT is the component doing the transcription. It processes raw audio — often streamed in real time — and outputs a sequence of words along with confidence scores and timing information.

How ASR accuracy is measured

The standard metric is Word Error Rate (WER), calculated as the number of insertions, deletions, and substitutions divided by the total number of words in the reference transcript. State-of-the-art models from Deepgram, Google, and OpenAI Whisper achieve WER below 5% on clean English audio. However, accuracy degrades with background noise, strong accents, and domain-specific jargon, which is why many providers offer custom model training.

STT in voice AI pipelines

In a live phone call handled by a voice AI, STT operates in streaming mode: it begins transcribing as soon as the caller starts speaking, emitting partial results before the caller finishes their sentence. This reduces perceived latency because the language model can start processing before the full utterance is complete. Low-latency STT is critical — every extra 100 milliseconds of delay makes the conversation feel less natural.

Providers commonly used in production voice AI include Deepgram (known for speed and accuracy on phone audio), Google Cloud Speech-to-Text, and OpenAI Whisper. Choosing the right provider depends on latency requirements, supported languages, and cost per audio minute.

Related terms

Frequently asked questions

What is speech to text?
Speech to text (STT) is technology that automatically transcribes spoken language into written text. Also called automatic speech recognition (ASR), it powers voice assistants, call transcription, live captioning, and the listening component of voice AI systems.
How accurate is speech to text technology?
Leading STT providers achieve word error rates below 5% on clean English audio. Accuracy depends on audio quality, background noise, speaker accent, and domain vocabulary. Custom-trained models can improve accuracy for specific industries like healthcare or legal.
What is the difference between STT and ASR?
They refer to the same technology. Speech to text (STT) is the more common term in product contexts, while automatic speech recognition (ASR) is preferred in academic and research settings. Both describe the process of converting spoken audio into written text.

See voice AI in action

Try Prisma Voices free and hear the difference an AI receptionist makes.

Start Free Trial