// computers › String & Pattern Matching

Aho-Corasick

Search for MANY patterns at once in a single pass, using a trie with failure links.

build a trie of all patterns + failure links; scan the text once, following links on mismatch

Frequently asked questions

Is Aho-Corasick how a chat app's profanity filter works?

Exactly. A profanity filter scans your message against an entire banned-word list in one single pass - not once per word. Aho-Corasick builds all the banned words into one automaton (a trie with failure links), then reads your message character by character, reporting any banned word the moment it completes. The cost does not grow with the size of the word list, which is why it suits filters, intrusion detection, and any 'match against a big dictionary at once' task.

How does Aho-Corasick search for many patterns at the cost of one scan?

It compiles all the patterns into a single automaton (a trie with failure links) before scanning. As it reads the text character by character, the automaton's current state encodes 'what is the longest suffix of the text so far that is a prefix of some pattern.' Whenever a state corresponds to the end of one or more patterns, those are reported. Because all patterns are baked into one machine, a single left-to-right pass over the text simultaneously checks for every pattern - the time does not grow with the number of patterns, only their total size.

What are failure links and how do they relate to KMP?

Failure links are Aho-Corasick's generalisation of KMP's failure function to many patterns. In KMP, on a mismatch you jump to the longest prefix of the single pattern that is still a suffix of what you matched. In Aho-Corasick, each trie node has a failure link to the node representing the longest proper suffix of its string that is also a node in the trie. Following these links on a mismatch lets the automaton reuse partial matches across all patterns without restarting, which is what keeps the scan linear.

How is the automaton built?

First, all patterns are inserted into a trie, so shared prefixes share paths (e.g. 'she' and 'shell' share 's-h-e'). Then failure links are added by a breadth-first traversal of the trie: each node's failure link is computed from its parent's, layer by layer. Output sets are also propagated along failure links so that finishing one pattern can simultaneously report shorter patterns that end at the same point (like 'he' inside 'she'). This preprocessing is O(total pattern length).

Why can one text position match several patterns at once?

Because patterns can be suffixes of one another or overlap. When the automaton reaches the state for 'she', that state's string also ends with 'he', so if 'he' is a pattern it is reported at the same position. The output sets, propagated along failure links during construction, ensure that every pattern ending at the current position is found - not just the longest. This is essential for dictionary matching, where you want all hits, not just one.

Where is Aho-Corasick used in the real world?

Anywhere a stream must be checked against a large fixed dictionary. Network intrusion-detection systems (like Snort) use it to scan packets for thousands of attack signatures at once. Spam and profanity filters check messages against banned-phrase lists. Antivirus scanners match against malware signatures. Search-and-highlight features, bibliographic matching, and bioinformatics (scanning sequences for many motifs) all rely on it. Its defining strength is constant-pass scanning regardless of dictionary size.