// computers › String & Pattern Matching

Rabin-Karp

Compare cheap rolling hashes of each window; only check characters when hashes match.

hash each text window with a rolling hash; if it equals the pattern hash, verify directly

Text Pattern

Frequently asked questions

Is Rabin-Karp like a plagiarism checker comparing fingerprints?

Yes - that is its standout use. A plagiarism checker compares the 'fingerprint' of your paragraph against thousands of others, rather than matching text word-by-word. Rabin-Karp turns each window of text into a numeric hash (a fingerprint) and only compares actual characters when fingerprints match. Because the rolling hash updates in constant time as the window slides, and because you can check many patterns' fingerprints at once, it scales to comparing huge numbers of documents.

What is a rolling hash and why is it the key to Rabin-Karp?

A rolling hash is a numeric fingerprint of a window of characters that can be updated in constant time when the window slides one step: you subtract the contribution of the character leaving the window and add the one entering, rather than rehashing the whole window. Without it, hashing each of the n windows would cost O(m) each, giving O(n*m) - no better than naive matching. The rolling update is what makes Rabin-Karp's average time O(n + m): each slide is O(1).

What is a 'spurious hit' in Rabin-Karp?

It is when a text window's hash equals the pattern's hash but the actual characters differ - a hash collision. Because a hash compresses many characters into one number, different strings can occasionally share a hash. So a hash match is only a strong hint, not proof; Rabin-Karp verifies a hash hit by comparing the actual characters. A spurious hit means that verification fails. Using a large prime modulus makes collisions rare, keeping the average cost low.

If it can have collisions and a bad worst case, why use Rabin-Karp?

Two big reasons. First, it naturally extends to searching for many patterns at once: hash all the patterns, store them in a set, and a single window hash can be checked against all of them together - very useful for things like detecting any of thousands of known strings. Second, the fingerprinting idea powers plagiarism and duplicate detection, where you compare hashes of many substrings across documents. For single-pattern search, KMP or Boyer-Moore are often preferred, but Rabin-Karp's hashing approach unlocks problems they cannot easily tackle.

How does Rabin-Karp handle searching for multiple patterns efficiently?

You precompute the hash of every pattern and store them in a hash set (all patterns must be the same length for a single rolling hash, or you use one rolling hash per distinct length). Then, as you roll through the text, each window hash is looked up in the set in O(1). One pass over the text can therefore screen for many patterns simultaneously, with character verification only on hits. This is its standout advantage over KMP and Boyer-Moore, which are built around a single pattern.

How is Rabin-Karp used in plagiarism detection?

Documents are broken into overlapping substrings (shingles), and each is hashed into a fingerprint. Two documents are then compared by how many fingerprints they share, rather than by slow direct text comparison. Techniques like winnowing select a representative subset of these hashes to keep the comparison efficient at scale. The rolling-hash fingerprint idea at the heart of Rabin-Karp is what makes comparing millions of document substrings computationally feasible.