// computers › Probabilistic & Sequence

Beam Search

Keep only the best few partial sequences at each step - how AI text generation explores options.

at each step, expand all kept sequences, score them, keep the top-k (the beam width)

Beam width (k) Steps (length)

Frequently asked questions

Is beam search how my phone's predictive text writes a sentence?

Yes - that is the clearest everyday example. Predictive text does not commit to the single best next word and hope; it keeps a few likely word-paths alive at once and picks the smoothest overall sentence. Beam search does exactly this: at each step it keeps the top-k partial sequences (k = the beam width), extends and rescores them, and prunes back to k. Keeping several candidates alive avoids the trap where the best first word leads to an awkward sentence.

How does beam search differ from greedy search?

Greedy search commits to the single best option at each step and never reconsiders - fast, but it can paint itself into a corner, because the locally best first word may lead to a poor overall sentence. Beam search keeps the top-k options alive at every step (k is the beam width), so it can recover if an early choice that looked second-best turns out to lead somewhere better. Greedy search is exactly beam search with k=1; widening the beam explores more alternatives at proportional extra cost.

What is the beam width and how do you choose it?

The beam width k is how many partial sequences you keep at each step. Small k (like 1-3) is fast and memory-light but explores little; large k explores more thoroughly and usually finds higher-scoring sequences, but costs more time and memory and gives diminishing returns. In machine translation, beam widths around 4-10 are common. Choosing k is a direct quality-versus-cost dial: there is no universally optimal value, so it is tuned per application and per latency budget.

Why does beam search not guarantee the best possible sequence?

Because it prunes. At each step it throws away all but the top-k candidates, and a sequence that scores poorly early might have become the overall best if continued - but it was discarded before it could. Only exhaustive search (keeping every sequence) guarantees the global optimum, and that is exponentially expensive. Beam search is a heuristic compromise: it usually finds very good sequences far more cheaply than exhaustive search, accepting that it can occasionally miss the true best.

How is beam search used in language models and AI text generation?

When a model generates text, at each step it predicts a probability for every possible next token; the number of possible sequences explodes exponentially with length. Beam search keeps the k most probable partial sequences (by cumulative log-probability), extends each by the likely next tokens, rescoring and re-pruning to k each step. This tends to produce more fluent, higher-probability output than greedy decoding. It is widely used in machine translation and speech recognition, though for open-ended creative text, sampling methods are often preferred because beam search can be repetitive.

What are the downsides of beam search in text generation?

It can favour generic, repetitive, or overly short output, because high-probability sequences are often bland - the safest words are common ones. Very large beams can paradoxically make this worse, concentrating on dull but probable text. For tasks with a single correct answer (translation, transcription) beam search excels. For open-ended generation (stories, dialogue), practitioners often use sampling techniques like top-k or nucleus sampling instead, or combine beam search with length and diversity penalties to counteract its tendency toward safe, short, repetitive sequences.