// engineering › Applications of Fourier

Vowel Synthesis from Formants

See how three resonant formant frequencies shape a glottal buzz into recognisable vowels.

text{speech} = text{glottal source} ast text{vocal-tract filter}

A mind behind this: Joseph Fourier 1768–1830

⚠ Learning tool: waveforms are drawn by a browser script and may be imperfect (floating-point rounding, finite-term truncation/Gibbs, rendering quirks). Treat as a visual aid, not an authoritative reference; cross-check against a textbook or established software (Wolfram, MATLAB, SciPy).

Formula

\[ \text{speech} = \text{glottal source} \ast \text{vocal-tract filter} \]

Vowel Synthesis from Formants

Your vocal tract is an LTI filter. When you say "aaah", three resonant frequencies — called formants F₁, F₂, F₃ — colour the buzz from your vocal cords. Move those three resonances and you change the vowel. This is how every text-to-speech engine since 1980 has worked under the hood.

Source-filter model of speech

\[ y(t) = \underbrace{\sum_{n=1}^{N} \frac{1}{n}\sin(2\pi n f_{0} t)}_{\text{glottal buzz}} \;\times\; \underbrace{\sum_{k=1}^{3} A_{k}\,\delta(f - F_{k})}_{\text{vocal tract formants}} \]

Reference: Wikipedia — Formant; Klatt, D. H. (1980). "Software for a cascade/parallel formant synthesizer." J. Acoust. Soc. Am. 67(3), 971–995. Rabiner & Schafer (2011). Theory and Applications of Digital Speech Processing, Ch. 4.

🧪 Try these experiments in order

Click /a/ (father). F₁ jumps to ~730 Hz, F₂ to ~1090 Hz. Notice the spectrum has three bright peaks.
Click /i/ (bee). F₁ drops, F₂ shoots way up. The vowel sound changes entirely — same vocal cords, different tract shape.
Manually drag F₂ slider down toward F₁. The vowel turns into mush at the moment they merge — your ear (and any speech recogniser) can no longer tell them apart.
Try the "whispering" experiment: set fundamental f₀ near 0. The buzz disappears, only the formant pattern remains.

Fundamental f₀ (Hz) 120 F₁ (Hz) 730 F₂ (Hz) 1090 F₃ (Hz) 2440

Vowel preset:

Time-domain waveform (one period of the vowel)

Spectrum — three formant peaks define which vowel you hear

⚠ Watch out for

If F₁ > F₂ (they get swapped on the slider), the vowel "label" makes no sense — by convention F₁ is the lowest formant.
Real human vowels also have bandwidth around each peak — this demo shows pure spikes, which is why your computer voice still sounds robotic compared with a real person.

✅ Do

Use formant maps (F₁ vs F₂ plots) for linguistic phonetics research and dialect studies.

❌ Don't

Conflate "formant frequency" with "pitch". Pitch = f₀ (vocal-fold rate). Formants = vocal-tract resonances. You can sing the same vowel at any pitch.

Where this matters in industry

Text-to-speech (Klatt synthesiser, modern neural TTS still uses formant-aware losses), automatic speech recognition, speech codecs (cellular & VoIP), audio forensics, voice biometrics, hearing-aid signal enhancement, sung-vowel recognition for music information retrieval.

🎯 Learning checkpoint

The /i/ vowel has the highest F₂ of all English vowels. What physical shape must your mouth make to produce a high F₂? (Hint — it's about cavity length.)

Frequently asked questions

What is a formant?

A formant is a resonant frequency of your vocal tract — a peak in the spectrum where your mouth and throat amplify sound. The first two or three formants are what your brain uses to tell 'ee' from 'ah' from 'oo'. Move the F1 and F2 sliders here and watch a vowel morph into another.

How can a phone or computer recognise vowels?

By finding the formant peaks in the spectrum — essentially doing the Fourier analysis this page visualises. Speech recognition, voice assistants, and the 'autotune' effect in music all rely on tracking these resonances. Your vocal tract is, mathematically, a filter shaping a buzzing source.

Why do men, women and children sound different saying the same vowel?

Two reasons. The pitch (f0, the buzz of the vocal folds) differs — lower for larger larynxes. And formant frequencies scale with vocal-tract length — a shorter tract pushes them higher. The vowel identity stays the same because the pattern of formants stays similar even as everything shifts.

What is the 'source-filter' model the formula mentions?

It splits speech into two parts: a source (the buzzing vocal folds, rich in harmonics) and a filter (the mouth and throat, which boost some frequencies and cut others). The vowel you hear is the source shaped by the filter — multiplication in the frequency domain, which is what the envelope on this page shows.

Is this how singers shape their voice?

Yes — trained singers consciously tune their formants. Opera singers use the 'singer's formant' around 3 kHz to project over an orchestra without amplification, and overtone singers move formants so precisely they can make individual harmonics audible as separate notes.