Voice Synthesis
AI generation of human-sounding speech from text, producing audio that closely resembles a real human voice.
Voice synthesis (also called text-to-speech or TTS) converts written text into spoken audio using an AI model. Modern AI voice synthesis produces natural-sounding speech with appropriate pacing, emphasis, and emotion — indistinguishable from a human voice to most listeners.
How it works: Neural TTS models are trained on hours of recorded speech. They learn to map text to waveforms, capturing patterns of rhythm, tone, and prosody. Advanced models like ElevenLabs can clone a voice from a short audio sample.
Key capabilities in 2026: - Voice cloning: Create a digital replica of any voice from 1–5 minutes of audio. - Emotion and style control: Adjust delivery for narration, advertising, or conversational tone. - Multilingual: Generate speech in 30+ languages while preserving a cloned voice. - Real-time synthesis: Generate speech fast enough for live phone calls and voice agents.
Commercial uses: Voiceover for YouTube videos, podcast production, e-learning courses, customer service phone bots, and accessibility tools.
Ethical note: Cloning a voice requires consent. Most platforms prohibit cloning voices without permission.
Example
You paste a 500-word script into ElevenLabs, select a cloned voice, and download a broadcast-quality MP3 in under 10 seconds — no microphone, no recording session.
Related terms
Inference
The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.
AI Agent
A software program that uses an AI model to take autonomous actions — browsing the web, writing files, calling APIs — to complete a goal without step-by-step human instructions.