STT (Speech-to-Text) — AI Voice Companion Glossary

Definition

STT (Speech-to-Text), also called Automatic Speech Recognition (ASR), is the technology that converts spoken audio into written text. In AI companion platforms, STT is the first stage of a real-time voice call pipeline — your speech is captured by your microphone, converted to text by the STT system, and then processed by the AI to generate a response.

Without STT, voice calls with AI companions are impossible. The quality of the STT component directly determines how accurately the AI understands you.

How STT Works in a Voice Call

During a real-time AI companion voice call, the STT system operates in the background continuously:

Audio capture: Your microphone captures the audio stream
Voice activity detection: The system identifies when you are speaking versus silent
Streaming transcription: As you speak, the STT system begins converting your words to text in real time — not waiting until you finish
Transcription delivery: The text output is passed to the language model for response generation

The use of streaming transcription (processing words as you speak rather than after you finish) is what reduces total response latency — the AI can begin processing before you’ve completed your sentence.

Why STT Quality Matters

STT quality determines whether the AI understands what you actually said. Poor STT creates a cascade of problems downstream:

Misheard words produce responses to something you didn’t say
Accent handling issues cause high error rates for non-standard accents
Background noise sensitivity means conversations in non-quiet environments become unreliable
Latency issues make the voice call feel stilted even before the AI responds

High-quality STT systems handle a wide range of accents, moderate background noise, natural speaking pace variation, and incomplete sentences without breaking transcription accuracy. This is why STT is one of the dimensions that separates platforms with genuinely good voice calls from those with technically present but practically frustrating voice features.

STT in Practice

On Affiny, STT runs continuously during a live voice call — your speech is transcribed in real time and passed to the companion AI before you’ve finished speaking, which is what allows the companion to respond without a long pause after you stop talking. The same streaming STT principle applies on any platform offering real-time bidirectional voice.

STT vs TTS

STT and TTS (Text-to-Speech) are the two audio conversion stages that bookend every AI voice call:

STT converts your spoken voice → text the AI can process
TTS converts the AI’s text response → spoken audio you can hear

Together they create the bidirectional voice pipeline. STT quality affects whether the AI understands you; TTS quality affects whether the AI’s response sounds natural and expressive.

FAQ

What is STT in AI companion apps?

STT (Speech-to-Text) is the technology that converts your speech into text during a voice call. It’s the first step in real-time voice conversation — your words are captured, transcribed, and passed to the AI so it can generate a response in your companion’s voice.

Does STT affect voice call quality?

Significantly. Poor STT produces transcription errors that cause the AI to respond to something you didn’t say. Good STT handles accents, background noise, and natural speech patterns accurately. The STT quality is one of the primary reasons voice calls feel natural on some platforms and frustrating on others.

Is STT different from voice recognition?

The terms are often used interchangeably. Voice recognition technically refers to identifying who is speaking; speech-to-text refers to converting what they said to text. In the AI companion context, “voice recognition” typically means STT — converting speech to processable text.