- find a mismatch (in which case, we declare that a\neq{b}) or
- we reach the end (in which case, we declare that a=b).
The question we are trying to address here is if there is a way to verify equality with high confidence (say 99% confidence) without paying the penalty of comparing each bit or the associated network overhead. We describe a randomized algorithm that does this, with significantly less than O(n) comparisons. Conceptually, the algorithm hashes the data and compares the hashed values instead of the original data (hardly a novel idea). The interesting part of the algorithm is how it comes up with a provably good hash function.
Algorithm
- Treat bits of data as bits of a large number. So 1 peta byte (PB) of data becomes a single number with 2^{53} bits, meaning its value can be of the order of 2^{2^{53}}.
- Since we want to compare two pieces of data, a and b, conceptually, we are looking at the number c=|a-b|. In particular, we are interested in finding out if c=0, because if it is, then the input data is equal, not otherwise.
- A number c cannot have more than \lfloor{\log_2{c}}\rfloor unique prime divisors. So a number representing 1 PB of data has no more than 2^{53}\approx 8\times{10^{15}} unique prime divisors.
- Choose \tau=2(\log_2{c})^2\log_2({\log_2{c}}). For 1 PB, \tau\approx{7\times 10^{32}}. By the prime number theorem, this number has about 10^{31} prime numbers less than it. These are enough prime numbers such that if we were to draw one at random from it, the probability that it is one of the prime divisors of c is less than 1/\log_2{c}. In other words, the number of primes less than \tau is about the square of the number of prime divisors of c. For our example, this probability is less than 1/2^{53}\approx1.25 \times 10^{-16}, if I got the calculations correct.
- Choose a prime number less than \tau at random. Note that this involves choosing O(\log\tau) random bits. In our example, this is about 110 bits.
- Construct a hash function F_p(x) = x \mod p.
- Determine F_p(a) and F_p(b). Since these hash functions are mod functions, the output is no more than log(\tau) bits, in our case 110 bits.
- If F_p(a)=F_p(b), then we say that a=b. Otherwise, we say that a\neq b. So we compared O(\log{\tau}) number of bits, instead of O(n).
- The probability that this algorithm errs is less than 1/\log_2{c}. So in our case, it is 1.25 \times 10^{-16}, which is practically zero.
Conclusion and notes
- To determine equality of 1PB of data, instead of comparing 2^{53} bits of data, we compared 110 bits of data. This introduced a minuscule probability of error.
- However, computing prime numbers is a hard problem (and large ones, especially so!). But this problem is CPU intensive. So we replaced a network intensive operation with a CPU intensive one, with a small probability of error.
- Computing hash function F_p(x) needs access to the entire content of x, which is potentially a disk intensive operation.
References
- Motwani, R. and Raghavan, P., Randomized Algorithms, 2007.
No comments:
Post a Comment