About Me

My photo
Ravi is an armchair futurist and an aspiring mad scientist. His mission is to create simplicity out of complexity and order out of chaos.

Sunday, July 1, 2012

Verifying equality of large data

The general pattern of verifying equality of two pieces of data, say $a$ and $b$, is comparing each bit in sequence, until we either:
  • find a mismatch (in which case, we declare that $a\neq{b}$) or
  • we reach the end (in which case, we declare that $a=b$).
With this approach, our answer is correct with 100% confidence. However, for this absolute confidence, we need $O(n)$ comparisons, which may be inconvenient as $a$ and $b$ can be arbitrarily large, e.g. 1 PB of data. Additionally, $a$ and $b$ may not necessarily be available at the same time (e.g. $a$ and $b$ are two databases that are geographically separated), so comparing bits has network overhead.

The question we are trying to address here is if there is a way to verify equality with high confidence (say 99% confidence) without paying the penalty of comparing each bit or the associated network overhead. We describe a randomized algorithm that does this, with significantly less than $O(n)$ comparisons. Conceptually, the algorithm hashes the data and compares the hashed values instead of the original data (hardly a novel idea). The interesting part of the algorithm is how it comes up with a provably good hash function.


  1. Treat bits of data as bits of a large number. So 1 peta byte (PB) of data becomes a single number with $2^{53}$ bits, meaning its value can be of the order of $2^{2^{53}}$.
  2. Since we want to compare two pieces of data, $a$ and $b$, conceptually, we are looking at the number $c=|a-b|$. In particular, we are interested in finding out if $c=0$, because if it is, then the input data is equal, not otherwise.
  3. A number $c$ cannot have more than $\lfloor{\log_2{c}}\rfloor$ unique prime divisors. So a number representing 1 PB of data has no more than $2^{53}\approx 8\times{10^{15}}$ unique prime divisors.
  4. Choose $\tau=2(\log_2{c})^2\log_2({\log_2{c}})$. For 1 PB, $\tau\approx{7\times 10^{32}}$. By the prime number theorem, this number has about $10^{31}$ prime numbers less than it. These are enough prime numbers such that if we were to draw one at random from it, the probability that it is one of the prime divisors of $c$ is less than $1/\log_2{c}$. In other words, the number of primes less than $\tau$ is about the square of the number of prime divisors of $c$. For our example, this probability is less than $1/2^{53}\approx1.25 \times 10^{-16}$, if I got the calculations correct.
  5. Choose a prime number less than $\tau$ at random. Note that this involves choosing $O(\log\tau)$ random bits. In our example, this is about 110 bits.
  6. Construct a hash function $F_p(x) = x \mod p$. 
  7. Determine $F_p(a)$ and $F_p(b)$. Since these hash functions are mod functions, the output is no more than $log(\tau)$ bits, in our case 110 bits.
  8. If $F_p(a)=F_p(b)$, then we say that $a=b$. Otherwise, we say that $a\neq b$. So we compared $O(\log{\tau})$ number of bits, instead of $O(n)$.
  9. The probability that this algorithm errs is less than $1/\log_2{c}$. So in our case, it is $1.25 \times 10^{-16}$, which is practically zero.

Conclusion and notes

  • To determine equality of 1PB of data, instead of comparing $2^{53}$ bits of data, we compared 110 bits of data. This introduced a minuscule probability of error.
  • However, computing prime numbers is a hard problem (and large ones, especially so!). But this problem is CPU intensive. So we replaced a network intensive operation with a CPU intensive one, with a small probability of error.
  • Computing hash function $F_p(x)$ needs access to the entire content of $x$, which is potentially a disk intensive operation.


  • Motwani, R. and Raghavan, P., Randomized Algorithms, 2007.

No comments:

Post a Comment