About Me

My photo
Ravi is an armchair futurist and an aspiring mad scientist. His mission is to create simplicity out of complexity and order out of chaos.
Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Friday, August 17, 2012

Schwartz-Zippel Lemma

The intuition behind this lemma seems rather straight-forward. An algebraic multi-variate equation of total degree $d$ can have no more than $d$ roots. In other words, a maximum of $d$ values (or elements from a field, if you will) can satisfy that algebraic equation (unless, of course, the equation is identical to zero, in which case any element satisfies the equation).

Examples
  • $x^2-3x+5$ has total degree 2. So it can have no more than 2 roots.
  • $x^3+x^2y^2+y^3$ has total degree 4. So it can have no more than 4 roots.

With the above intuition, it seems "obvious" that for any given set of elements (no matter how large or small), at most $d$ elements from that set can satisfy the given equation. So, the probability that a random element from this set will satisfy the equation is no more than $d/|S|$. And that is exactly the statement of this lemma.

Statement
Let $\mathbb{Q}(x_1,\cdots,x_n)\in\mathbb{F}[x_1,\cdots\,x_n]$ be a multi-variate polynomial of total degree $d$. Fix any finite set $S\subseteq\mathbb{F}$, and let $r_1,\cdots,r_n$ be chosen independently and uniformly at random from $S$. Then \[\mathbf{Pr}[\mathbb{Q}(r_1,\cdots,r_n)=0\hspace{1mm}|\hspace{1mm}\mathbb{Q}(x_1,\cdots,x_n)\not\equiv 0]\leq\frac{d}{|S|}\]
References

Friday, June 15, 2012

On Weak and Strong Laws of Large Numbers

Introduction

In statistics, we use the mean calculated from a sample as an estimate for the mean of the population. E.g. if the average height of a random sample of a thousand people from a region is 6 feet, then we estimate that the average height of all people in that region is 6 feet. Why does this work? The weak and strong laws of large numbers provide a theoretical basis for why this works. Given below are the laws themselves and their difference, which serves as a justification for their names.


Notation

  1. Let $X_1$, $X_2$, $\cdots$, $X_n$ be independent and identically distributed random variables. (In layman terms, $X_1$ is the first observation, $X_2$ is the second and so on.)
  2. Let $M_n$ be a random variable denoting the mean of $X_1$, $X_2$, $\cdots$, $X_n$. In other words, \[M_n=\frac{1}{n}\sum_{1}^{n}X_i\]So this is the mean of the sample.
  3. Let $\mu$ be the mean of each of $X_1$, $X_2$, $\cdots$, $X_n$. In other words, $\mathbf{E}(X_i)=\mu$ for each $i$. So $\mu$ is the mean of the population (usually unknown, which is why we want to estimate it!).

Weak Law of Large Numbers

This law states that for any $\epsilon>0$,
\[\lim_{n\to\infty}\mathbf{P}(|M_n-\mu|>\epsilon)=0\]Interpretation
  • For large values of $n$ (i.e. $n>n_0$ for some $n_0$), the probability that the value of $M_n$ (the sample mean) differs from the population mean $\mu$ by more than any given number $\epsilon$ is 0.
  • Alternatively, all probability is concentrated in an $\epsilon$-interval around $\mu$.
  • Alternatively, almost surely, for large samples, the sample mean is within an $\epsilon$ neighborhood of the population mean.

Strong Law of Large Numbers

This law states that
\[\mathbf{P}(\lim_{n\to\infty}M_n=\mu)=1\]Interpretation

  • For large values of $n$ (i.e. $n>n_0$ for some $n_0$), the probability that the value of $M_n$ (the sample mean) differs from the population mean at all is 0.
  • Alternatively, all probability is concentrated at $\mu$.
  • Alternatively, almost surely, for large samples, the sample mean is exactly the population mean.

Difference between the two laws

  • Strong law is stronger than the weak law because the strong law allows for $\epsilon=0$, while the weak law has to have  $\epsilon>0$.
  • Per the strong law, all probability is concentrated at $\mu$, while per the weak law, it is concentrated in the interval $(\mu-\epsilon,\mu+\epsilon)$, which is infinitely larger because $\epsilon>0$.
  • Because the probability of the sample mean, $M_n$ differing from population mean $\mu$ is 0, the strong law allows for only a finite number of values of $M_n$ to differ from $\mu$. In other words, there are only a finite number of sequences $X_1$, $X_2$, $\cdots$, $X_n$ whose mean $M_n$ differs from $\mu$. Now that is a very strong statement!
  • Because the probability of the sample mean $M_n$ differing from population mean $\mu$ is positive (although small), the weak law allows for an infite number of values of $M_n$ to differ from $\mu$. In other words, there are an infinte number of sequences $X_1$, $X_2$, $\cdots$, $X_n$ whose mean $M_n$ differs from $\mu$. This is clearly weaker than the previous statement.

References

Friday, May 18, 2012

Freivalds' Algorithm

Introduction

Freivalds' algorithm is used to verify correctness of matrix multiplication. It is a randomized Monte Carlo algorithm and has a small probability of error. By repeatedly running this algorithm, one can reduce the probability of error below any threshold.


Motivation

One naive way to verify correctness of matrix multiplication is to actually redo the multiplication and compare the output with the given input. However, current matrix multiplication algorithms are quite slow - the fastest one being $O(n^{2.376})$. Freivalds' algorithm takes $\Theta(n^2)$ time to verify this multiplication and errs with probability $\leq 1/2$, only when the multiplication is incorrect. It therefore runs faster than matrix multiplication itself and is usually correct.


Freivalds' algorithm

  1. Input: $n\times{n}$ matrices $A$, $B$ and $C$. We would like to verify if $A\times{B}=C$.
  2. Choose a $n\times{1}$ column vector $\overrightarrow{r}\in\{0,1\}^n$, uniformly and at random. In other words, each component of the vector is either 0 or 1 and chosen randomly and uniformly.
  3. Compute $A\cdot(B\cdot\overrightarrow{r})$ and $C\cdot\overrightarrow{r}$. This takes $\Theta(n^2)$ time.
  4. Now comes the output
    1. If $A \cdot (B\cdot\overrightarrow{r}) \neq {C\cdot\overrightarrow{r}}$, output FALSE. In other words, we say that the given matrix multiplication is incorrect. This has a probability of correctness = 1.
    2. If $A \cdot (B\cdot\overrightarrow{r}) = {C\cdot\overrightarrow{r}}$, output TRUE. In other words, we say that the given matrix multiplication is correct. This has a probability of correctness $\geq{1/2}$. (See references for details.)

Discussion

  • If Freivalds' algorithm says that:
    • $A\times{B}\neq{C}$, then this statement is always correct.
    • $A\times{B}={C}$, because we are only doing a small number of checks ($n$ - one for each row of the result), it is possible that $A\times{B}\neq{C}$, but the algorithm cannot determine it in the amount of time it has. (See references for details.)
  • Stated another way:
    • when $A\times{B}={C}$, Freivalds' algorithm is always correct.
    • When $A\times{B}\neq{C}$, the algorithm errs with probability $\leq{1/2}$.
  • By re-running this algorithm $k$ times and choosing a random $\overrightarrow{r}$ each time, we can reduce the probability of error to $\leq{1/2^{k}}$.
  • In other words, in $\Theta({kn^2})$ time, Freivalds' algorithm guarantees that the error in verifying multiplication of two $n\times{n}$ matrix is $\leq{1/2^{k}}$.

References

  1. Wikipedia - Freivalds' algorithm
  2. Motwani, R. and Raghavan, P., Randomized Algorithms, 2007.

Tuesday, November 8, 2011

Markov and Chebyshev inequalities

Markov inequality
Intuitively, this inequality states that the probability that a random variable's value is much greater than its mean is low and that the larger this value, the smaller its probability. It states that \[\mathbf{P}(X\geq a)\leq\frac{\mu}{a},\hspace{10 mm}\forall a>0\] where $\mu$ is the mean, a.k.a. the expected value $\mathbf{E}(X)$ of the random variable. Note that this inequality needs that the random variable $X$ take non-negative values.

Here are some examples

  • $\mathbf{P}(X\geq 10)\leq\mu/10$, which doesn't say much since $\mu$ could be greater than $10$. But the next one is more interesting.
  • $\mathbf{P}(X\geq10\mu)\leq1/10$, which means that the probability that a random variable is $10$ times its mean is less than or equal to $0.1$. Now that wasn't obvious before!

Chebyshev inequality
Unlike Markov inequality that involves just the mean, Chebyshev inequality involves the mean and the variance. Intuitively, it states that the probability of a random variable taking values far outside of its variance is low and the farther this value from its mean, the lower its probability. More formally, it states that \[\mathbf{P}(|X-\mu|\geq c)\leq\frac{\sigma^2}{c^2},\hspace{10 mm}\forall c>0\] This implies \[\mathbf{P}({X-\mu|\geq k\sigma)\leq\frac{1}{k^2}\] This means that the probability that a value falls beyond $k$ times the standard deviation is at most $1/k^2$. This inequality tends to give tighter bounds than Markov inequality.

More concretely
  • $\mathbf{P}(|X-\mu|\geq2\sigma)\leq1/4$, which means that the probability that a random variable takes values outside of 2 times its standard deviation is less than $1/4$.
  • $\mathbf{P}(|X-\mu|\geq3\sigma)\leq1/9$, which means that the probability that a random variable takes values outside of 3 times its standard deviation is less than $1/9$.

From Markov inequality to Chebyshev inequality
Derivation of Chebyshev inequality from Markov inequality is relatively straight forward. Let $Y=|X-\mu|^2$. According to Markov inequality \[\mathbf{P}(Y\geq c^2)\leq \mathbf{E}(Y)/c^2\] Since $\mathbf{E}(Y)=\sigma^2$ and $\mathbf{P}(Y\geq c^2)$ is equivalent to $\mathbf{P}(|X-\mu|\geq c)$, by substituting them, we get Chebyshev inequality \[\mathbf{P}(|X-\mu|\geq c)\leq\frac{\sigma^2}{c^2}\]