Vowel Synthesis from Formants
Your vocal tract is an LTI filter. When you say "aaah", three resonant frequencies — called formants F₁, F₂, F₃ — colour the buzz from your vocal cords. Move those three resonances and you change the vowel. This is how every text-to-speech engine since 1980 has worked under the hood.
Source-filter model of speech
Reference: Wikipedia — Formant; Klatt, D. H. (1980). "Software for a cascade/parallel formant synthesizer." J. Acoust. Soc. Am. 67(3), 971–995. Rabiner & Schafer (2011). Theory and Applications of Digital Speech Processing, Ch. 4.
🧪 Try these experiments in order
- Click /a/ (father). F₁ jumps to ~730 Hz, F₂ to ~1090 Hz. Notice the spectrum has three bright peaks.
- Click /i/ (bee). F₁ drops, F₂ shoots way up. The vowel sound changes entirely — same vocal cords, different tract shape.
- Manually drag F₂ slider down toward F₁. The vowel turns into mush at the moment they merge — your ear (and any speech recogniser) can no longer tell them apart.
- Try the "whispering" experiment: set fundamental f₀ near 0. The buzz disappears, only the formant pattern remains.
Time-domain waveform (one period of the vowel)
Spectrum — three formant peaks define which vowel you hear
⚠ Watch out for
- If F₁ > F₂ (they get swapped on the slider), the vowel "label" makes no sense — by convention F₁ is the lowest formant.
- Real human vowels also have bandwidth around each peak — this demo shows pure spikes, which is why your computer voice still sounds robotic compared with a real person.
✅ Do
Use formant maps (F₁ vs F₂ plots) for linguistic phonetics research and dialect studies.
❌ Don't
Conflate "formant frequency" with "pitch". Pitch = f₀ (vocal-fold rate). Formants = vocal-tract resonances. You can sing the same vowel at any pitch.
Where this matters in industry
Text-to-speech (Klatt synthesiser, modern neural TTS still uses formant-aware losses), automatic speech recognition, speech codecs (cellular & VoIP), audio forensics, voice biometrics, hearing-aid signal enhancement, sung-vowel recognition for music information retrieval.
🎯 Learning checkpoint
The /i/ vowel has the highest F₂ of all English vowels. What physical shape must your mouth make to produce a high F₂? (Hint — it's about cavity length.)