A hash designed so that SIMILAR items land in the same bucket - find 'alike', not identical.
choose hashes where Pr[h(a)=h(b)] grows with similarity(a,b); group items by bucket
Frequently asked questions
Is LSH how my phone auto-groups similar selfies?
That is a great match. Your gallery's 'similar photos' grouping does not compare every picture against every other - it drops lookalikes into the same pile, then only compares within a pile. LSH is the hashing trick that makes those piles: it is designed so similar items collide into the same bucket. Finding 'things like this' then means looking only in the query's bucket instead of scanning everything, which is what makes near-duplicate and similarity search fast.
How is LSH different from an ordinary hash table?
An ordinary hash function is designed to scatter keys as evenly as possible, so even two almost-identical keys usually land in different buckets - that is good for exact lookup but useless for finding 'similar'. LSH deliberately does the opposite: it is built so that the more similar two items are, the more likely they collide into the same bucket. So a bucket becomes a pool of probably-similar items, turning hashing into a tool for nearest-neighbour search rather than exact-match lookup.
What does 'locality-sensitive' actually mean?
It means the hash preserves locality - closeness. For a locality-sensitive hash family, the probability that two items get the same hash value rises smoothly with their similarity: near-identical items almost always collide, very different items almost never do. That property is the whole trick. It lets you replace 'compare the query against every item' (slow) with 'only compare against items that share the query's hash bucket' (fast), because anything truly similar is likely to be in that bucket.
Why does LSH use several hash tables instead of one?
A single LSH hash is noisy - sometimes two similar items just happen not to collide, so a real near-match could be missed. Using several independent hash tables and taking the union of their buckets dramatically reduces that risk: a true neighbour only has to collide in one of the tables to become a candidate. This is the 'amplification' step, often described as combining several hashes into bands. More tables means higher recall (fewer missed matches) at the cost of more memory and lookups.
How does LSH relate to HNSW and modern vector search?
Both solve approximate nearest-neighbour search - finding similar items in huge high-dimensional collections - but with different structures. LSH uses similarity-seeking hash buckets; HNSW uses a navigable layered graph. LSH is simpler and was historically popular, but HNSW usually achieves better accuracy at the same speed and has become the industry standard for vector databases. LSH still appears where its simplicity and predictable behaviour are valued, and as a teaching gateway to the whole idea of approximate similarity search.
What real problems is LSH used for?
Near-duplicate detection is the classic one: finding web pages, documents, or images that are almost the same (MinHash LSH powers large-scale near-duplicate web page detection). It is used in recommendation systems to find similar users or items quickly, in audio fingerprinting to match songs, and in plagiarism and copy detection. Anywhere you need to find 'things like this one' among millions without comparing against all of them, LSH is a candidate technique.