// computers › String & Pattern Matching

Suffix Trees & Suffix Arrays

Pre-index ALL suffixes of a text once, then answer any 'is this substring here?' query fast.

sort all suffixes (suffix array); binary-search the sorted suffixes for the pattern

Frequently asked questions

Is this like a pre-indexed DNA database?

Yes - that is the headline use. A biology lab pre-indexes a DNA database once so researchers can find any gene sequence in milliseconds afterward. A suffix array (or suffix tree) does this: it sorts all suffixes of the text once, after which any substring query is just a fast binary search. You pay a one-time build cost, then answer unlimited queries quickly - ideal when one huge fixed text, like a genome, is searched over and over.

What is a suffix array, in plain terms?

It is a list of every suffix of the text - every ending piece, from the whole string down to the last character - sorted into alphabetical order, stored compactly as just the starting positions. For 'banana', the suffixes are 'banana', 'anana', 'nana', 'ana', 'na', 'a'; sorted, they give an order like a, ana, anana, banana, na, nana. Because they are sorted, you can binary-search them. The array stores only the start indices, so it is memory-efficient: O(n) integers for a text of length n.

How does a suffix array find a pattern quickly?

Any occurrence of a pattern in the text is a prefix of some suffix. Since the suffixes are sorted, all suffixes starting with the pattern form a contiguous block. So you binary-search the sorted suffixes for that block: O(m log n), where m is the pattern length and n the text length. Every starting position in that block is an occurrence. Crucially, this works for any pattern after a single build, so you pay the indexing cost once and then answer many queries cheaply.

What is the difference between a suffix tree and a suffix array?

A suffix tree is a compressed trie of all suffixes; it answers a pattern query in O(m) - even faster than the array's O(m log n) - and supports more complex queries (longest repeated substring, longest common substring) elegantly. But it uses more memory (a larger constant factor) and is trickier to build. A suffix array stores the same information more compactly as a sorted list of positions, is simpler and cache-friendlier, and with an auxiliary LCP array can match most of the suffix tree's power. In practice, suffix arrays are often preferred for their space efficiency.

Why pre-index the text instead of just running KMP each time?

KMP and Boyer-Moore are great for a single search, taking O(n) per query because they scan the whole text. But if you must answer many different pattern queries against the same fixed text, scanning the entire text every time is wasteful. A suffix array or tree pays a one-time O(n) or O(n log n) build cost, after which each query is O(m log n) or O(m) - independent of the (possibly huge) text length. The trade is build-cost-once versus scan-cost-every-time; indexing wins when queries are frequent.

Where are suffix structures used in the real world?

Bioinformatics is the headline application: genomes are enormous fixed texts queried constantly for DNA and protein patterns, and suffix arrays (and the related FM-index/Burrows-Wheeler transform) power read-alignment tools like Bowtie and BWA. They are also used in full-text search indexes, data compression (the Burrows-Wheeler transform behind bzip2 is built on suffix sorting), plagiarism and longest-common-substring detection, and any setting where one large body of text must support fast, repeated substring queries.