Ravi is an armchair futurist and an aspiring mad scientist. His mission is to create simplicity out of complexity and order out of chaos.

## Wednesday, March 2, 2011

### Locality Sensitive Hashing Explored

I was catching up on some new developments in the field of computer science in the past decade and came across Locality Sensitive Hashing (LSH) among other things. It piqued my interest, especially its theoretical aspect. Here I explore this topic informally.

Intuition
The idea is to have a hash function that is sensitive to distances. By that we mean that if two points are "close" to each other, the probability that this function hashes them to the same bucket is "high". Conversely, if the points are "far" apart, the probability that they get hashed to the same bucket is "low".

The beauty of such a function is that we can probabilistically determine if two points are "close enough" by hashing them, without determining their actual distance. This can be useful, especially when points are in high dimensions, their distance computation is complex or we just have too many points to work with. We sacrifice accuracy (since the computation is probabilistic) for speed (since the hash function is presumably computationally cheap). However, as indicated later, we can increase this accuracy arbitrarily by sacrificing some speed. :-)

Distance measure
Central to LSH is the concept of distance. In fact, it is the starting point for LSH. Here are some popular distance measures:

• Jaccard distance, if your points are sets
• edit distance, if your points are strings
• euclidean distance, if points are in n dimensions.
Regardless of which distance measure you choose for your purpose (you can invent one if you like), it must satisfy 3 conditions:
1. d(x, x) = 0, i.e. distance between a point and itself is zero.
2. d(x, y) = d(y, x), i.e. the distance measure is symmetric.
3. d(x, y) + d(y, z) >= d(x, z) - known as the triangle inequality.

Hash function
Once a distance measure is decided upon, we need to come up with a locality sensitive hash function for that distance measure. Any random hash function may not necessarily work. In fact, it is not necessary that a distance measure will have an LSH at all! However, most well-known distance measures have known locality sensitive hash functions. E.g. Jaccard distance has "min-hash" functions that are locality sensitive.

More formally, a hash function f is said to be (d1, d2, p1, p2)-sensitive, if for any two points x and y:

• if d(x, y) < d1, then Probability(f(x) = f(y)) > p1, i.e. the probability that the two points get mapped by the hash function f to the same bucket is at least p1.
• and if d(x, y) > d2, then probability (f(x) = f(y)) < p2, i.e. the probability that the two points get mapped by the hash function f to the same bucket is at most p2.
There are some interesting observations here including that we only talk about d(x, y) being less than d1 or greater than d2, but nothing about when d1 < d(x, y) < d2, which as it turns out, is not important to the theory of LSH.

Family of hash functions
A locality sensitive hash function is useful in probabilistically determining if two points are "close" to each other or not. However, we are limited by the probabilities. E.g. a probability of p1 = 0.5 does not necessarily instill a reasonable confidence that the two points are close. However, if we have a family of such hash functions, then using probability theory, we can construct functions that give us arbitrarily high probabilities.

Here's how the construction goes:
• if f1, f2, ..., fn are (d1, d2, p1, p2)-sensitive, then we can AND them together to construct a (d1, d2, p1n, p2n)-sensitive hash function. This has the effect of reducing the probability.
• if f1, f2, ..., fn are (d1, d2, p1, p2)-sensitive, then we can OR them together to construct a (d1, d2, 1 - (1- p1)n, 1 - (1 - p2)n)-sensitive hash function. This has the effect of increasing the probability.
A combination of ANDs and ORs can get probabilities arbitrarily close to 1. The closer we want an LSH's p1 to 1 (and p2 to 0), the more ANDs and ORs we need to perform.

In closing
LSH is a cheap way to determine proximity of points in any space provided we can define a distance measure and come up with multiple hash functions that are locality sensitive. This technique has been applied to diverse problems like finding plagiarisms in student papers, detecting similar web pages, finger print matching, etc.

In my next post, I examine jaccard distance and min-hash function. Additionally, I look at other distance measures and some of their locality sensitive hashes.

References
1. J. Ullman and A. Rajaraman, Mining of Massive Datasets,  Chapter 3 - available at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf