Theory behind the Technology

Monday, August 1, 2011

Evaluating interviewers - Part 1

This post mathematically answers the question - "how to determine how good of an interviewer someone is".

We evaluate interviewers in 2 complementary ways:

With respect to other interviewers (covered in this blog post)
With respect to interview candidate's actual job performance after being hired (covered in the next blog post).

Each method has its strengths. For example, we usually have a lot of data for method 1 (since there are more candidates interviewed than hired), making it easier to evaluate an interviewer relative to other interviewers. However, the true test of any interviewer is the consistency with which they can predict a candidate's job performance should they be hired. This data may be hard to come by (or integrate, with HRM systems). But each method can be used independently, or collectively.

Definitions
(Note: subscript $i$ is used for candidates and $j$ for interviewers)

Symbol	Definition
$C_i$	$i^{th}$ candidate
$R_j$	$j^{th}$ interviewer
$s_{ij}$	score for the $i^{th}$ candidate by the $j^{th}$ interviewer (this is the grade, usually between 1 and 5, given by the interviewer to the candidate based on the interview)
$m_i$	number of interviewers in the interview panel for candidate $i$ (the number of interviewers, usually between 4 and 8, that the candidate faces during the course of the interview process)
$n_j$	number of candidates interviewed by interviewer $j$ (can be large, in tens or hundreds, especially for popular interviewers)
$\hat{n_j}$	number of candidates interviewed by interviewer $j$ that joined the company/group
$p_i$	job performance of $i^{th}$ candidate after joining the company/group (usually between 1 and 5, captured in a company-internal HRM system)
$s_i$	average score given by the interview panel for the $i^{th}$ candidate, $s_i=\sum_{j}s_{ij}/{m_i}$ (usually between 1 and 5)

Evaluating an interviewer w.r.t. other interviewers
Consider the random variable $X_j$ defined below, which is the difference between the score given by the $j^{th}$ interviewer and the average score given by the interview panel, for any candidate interviewed by $R_j$:
\[X_j=\{s_{ij}-s_i | R_j ~ \text{interviewed} ~ C_i\}\] We need to answer the following questions:

What does a probability distribution of $X_j$ look like?
What does the probability distribution of $X_j$ tell us about the interviewer $R_j$?

Before answering the above question, consider the following random variable, $X$: \[X=\sum_j{X_j}\]

This is the random variable for the difference between the score given by an interviewer and the mean of the score received by the candidate. Clearly, expectation of $X$, $E(X)=0$. Moreover, we expect $X$ to have a normal distribution. So we expect some $X_j$ to be centered to the left of 0, some to the right of 0, but most others around 0.

So the answers to the above questions are:

We expect the probability distribution of $X_j$ to be normal, centered around 0, on an average.
$E(X_j)$ tells us about the type of interviewer $R_j$ is:

$E(X_j)=0$ or more accurately, $|E(X_j)| < a\sigma$ (where $\sigma$ is the standard deviation of $X$, and $a>0$ is appropriately chosen) implies that the interviewer $R_j$ is a normal interviewer.
$E(X_j)\geq a\sigma$ implies that interviewer $R_j$ generally gives higher scores than average.
$E(X_j)\leq -a\sigma$ implies that interviewer $R_j$ generally gives lower scores than average.

Categorization of interviewers
From the above answers regarding $X_j$, we can categorize interviewer $R_j$ into a few distinct types:

The "regular" interviewer

Most interviewers' distribution of $X_j$ would look like the above. (Perhaps not as sharp a drop off as shown above around 0 - I just couldn't draw a good graph!) This indicates that most interviewers' scores would be close to the average score given by the interview panel.

The "easy" interviewer

The easy interviewer tends to give, on an average, higher scores than the rest of the interview panel. This causes the bulk of the graph to shift beyond 0. The farther this hump from 0, the easier the interviewer. If you are an interviewee, this is the kind of interviewer you want!

The "hard-ass" interviewer

The hard interviewer tends to give lower scores than the rest of the interview panel. This causes hump of the graph to go below zero. The farther below zero, the harder the interviewer. If you are an interviewee, this is the kind of interviewer you want to avoid!

The "activist/extremist" interviewer

Some interviewers tend to artificially influence the outcome of the interview process by giving extreme scores - like "definitely hire" or "definitely no hire". The graph would not be as sharp as the one depicted above, but the idea is that there would be an above-average frequency of extreme values.

The "clueless" interviewer

Some interviewers cannot correctly judge the potential of a candidate. Their scores would be all over the range, instead of a well-defined hump.

In closing

We presented a mathematical basis to compare interviewers. This yields a categorization of interviewers. In the next blog post, Evaluating Interviewers - Part II, we analyze how to evaluate an interviewer with respect to a hired candidate's job performance.

Sunday, July 10, 2011

Lebesgue Measure

Lebesgue measure is a measure of the length of a set of points, also called the $n$-dimensional "volume" of the set. So, intuitively:

1-dimensional volume is the length of a line segment,
2-dimensional volume is the area of a surface,
3-dimensional volume is the volume that we know of
and so on.

These measures are defined over the space of real numbers $\mathbb{R}$, as opposed to, say, the space of natural numbers $\mathbb{N}$.

The basics

Lebesgue measure of a set of real numbers between $[a, b]$ is $b-a$. This is consistent with our intuitive understanding of the length of a line segment.
Lebesgue measures are finitely additive. So the measure of a union of the set of points $[a, b]$ and $[c, d]$ is $(b-a) + (d-c)$. I.e. $\text{measure}([a,b]\bigcup[c,d])=(b-a)+(d-c)$. This is again consistent with our intuitive understanding of the sum of two line segments.
In fact \[\text{measure}(\bigcup_{i=1}^{\infty} A_i)=\sum_{i=1}^{\infty}\text{measure}(A_i)\] The above applies to any $n$-dimensional measure (length, area, volume, etc.) This is consistent with our intuitive understanding of the sum of multiple line segments, areas, volumes, etc.

Those were the basics. So far so good. We now look at Lebesgue measure for:

countable sets (set of natural numbers, for example),
uncountable sets (set of real numbers) and
some seeming bizarre sets for which a Lebesgue measure cannot be defined.

Lebesgue measure for countable sets

Lebesgue measure of integer points on the real line is $0$. This is because points are just locations on a real line and intuitively, integers are few and far between on a real line.
Lebesgue measure of a countable set is $0$. This is a stronger statement than the above. It is interesting, since it implies that a set containing countably infinite number of points is also $0$. This is not surprising though, since between any two distinct integer points on the real line lie uncountably many real points. So intuitively, a countable set of points is just not large enough to have a non-zero Lebesgue measure!
One implication of the above is that rational numbers have a measure of $0$. This is in spite of the fact that rationals are dense - i.e. between any 2 rational numbers, there are countably infinite number of rationals! So, countable infinity is no match for Lebesgue measure.

Lebesgue measure for uncountable sets

As hinted earlier, Lebesgue measure for an uncountable number of contiguous points is non-zero. So for the set of points $[a, b]$, this is $b-a$.
Removing a countable number of points from an uncountable set does not change its measure. E.g the measure of the half open set $(a, b]$ (which excludes just the point $a$) or the full open set $(a, b)$, which excludes both points $a$ and $b$ is $b-a$.
One implication of the above is that if we were to remove, say, all rational numbers from the set $[a, b]$, its measure is still $b-a$. So we removed all countably infinite number of rationals, and yet the measure doesn't budge. Cool!
Lebesgue measure of an uncountable set need not always be non-zero. Georg Cantor gave the construction of an uncountable set, called the Cantor set, which has uncountably infinite points, and yet has a measure of $0$, as shown in the next section.

Cantor Set - an uncountable set with 0 measure

Start with the set of points $\mathbb{R}[0, 1]$.
Divide it into 3 parts - $[0, 1/3]$, $(1/3, 2/3)$, $[2/3, 1]$ and remove the middle part $(1/3, 2/3)$ from the set, as depicted in the diagram below. (Note: the first line in the diagram represents $\mathbb{R}[0,1]$ and those below it represent the construction of the Cantor set as detailed in this and subsequent steps.)

Of the remaining parts, divide each into 3 parts, as shown in the diagram above. E.g.

Divide $[0, 1/3]$ into $[0, 1/9]$, $(1/9, 2/9)$, $[2/9, 1/3]$ and again remove the middle part $(1/9, 2/9)$ from the set.
Similarly, divide $[2/3, 1]$ into $[2/3, 7/9]$, $(7/9, 8/9)$ and $[8/9, 1]$ and remove the middle part, $(7/9, 8/9)$ from the set.

Keep repeating the above step, as shown in the diagram above.

The segments that we end up removing have a combined measure of \[\frac{1}{3}+\frac{2}{9}+\frac{4}{27}+\frac{8}{81}\cdots=\frac{1}{3}\sum_{i=0}^{\infty}\bigg(\frac{2}{3}\bigg)^i=1\]
This means that the number of points left has a Lebesgue measure of 0, even though the number of points left is uncountable!
This is bizarre indeed. However, there is something even more bizarre. There are some sets that are not Lebesgue measurable! I.e. we cannot define a measure for them. A Vitali set is one such non-measurable set.

Vitali Set - a non-measurable set
Construction of this set relies on the Axiom of Choice and is slightly involved.

Definition: We define 2 points, $a$ and $b$, as equivalent if $a-b$ is a rational number. We denote this as $a\sim b$. Note that $a$ and $b$ themselves, don't have to be rationals for them to be equivalent. Examples of equivalence are:

$\frac{1}{2}\sim \frac{1}{3}$, since $\frac{1}{2}-\frac{1}{3}=\frac{1}{6}$, a rational number.
$\pi\sim\pi+\frac{1}{2}$, since $\pi-(\pi+\frac{1}{2})=-\frac{1}{2}$, a rational number.

We partition the set of real numbers $\mathbb{R}[0, 1]$ into equivalence classes using the above definition. All points within an equivalence class are equivalent to each other, of course.

Each equivalence class has countably many points, one corresponding to each rational number.
There are uncountably many such equivalence classes, one corresponding to each irrational number (that is not separated from another irrational number by a rational distance).
By axiom of choice, we can construct a set containing one point from each of these equivalence classes, which we call $V$, the Vitali set.
Note that since elements of $V$ are drawn from $\mathbb{R}[0,1]$, $\text{measure}(V)\leq 1$.
Definition: We define a translation of $V=\{v_0,v_1,v_2,\cdots\}$ called $V_q=\{v_0+q,v_1+q,v_2+q,\cdots\}$ where $q\in\mathbb{Q}[0,1]$, the set of rationals between $[0,1]$. So for each rational number $q$, we have a corresponding uncountable set $V_q$ as shown in the diagram above.
Note that $\text{measure}(V)=\text{measure}(V_q)\quad\forall q\in\mathbb{Q}[0,1]$, since each $V_q$ is merely a translation of $V$.
Also note that \[\bigcup_{q\in\mathbb{Q}[0,1]}V_q=\mathbb{R}[0,2]\] i.e. we just partitioned $\mathbb{R}[0,2]$ into countably many partitions, $V_q, q\in\mathbb{Q}[0,1]$, with each partition containing an uncountable number of points.
With points (8) and (9) in mind, what is the Lebesgue measure of $V$?

If it is $0$, from (8), it means that each of $V_q$ has a measure of $0$. It implies, from (9), that $\text{measure}(\mathbb{R}[0,2])=0$ which is false, since $\text{measure}(\mathbb{R}[0,2])=2$.
If it is greater than $0$, say $a$, then from (8), each of $V_q$ has a measure of $a$. Hence, from (9), since there are infinitely many $q$s, $\text{measure}(\mathbb{R}[0,2])=\infty$ which is false, since $\text{measure}(\mathbb{R}[0,2])=2$.

Hence, $V$ is not Lebesgue measurable!

In closing
We saw that there are some sets with Lebesgue measure $0$ and some with non-zero. We also saw that there are some uncountable sets with measure $0$ and some sets with an undefined measure.

References

Wapner, Leonard, The Pea and the Sun, 2005.
The Axiom of Choice (Wikipedia)
Cantor Set (Wikipedia)
Vitali Set (Wikipedia)

Saturday, July 2, 2011

Hyperwebster - an uncountable dictionary

Introduction

There are an infinite number of points on the real line. In fact, there are more points on any segment of the real line, no matter how small, than all of natural numbers combined. This was proved by Cantor using the diagonal argument.

To get an idea of the number of points on a real line, Dr. Ian Stewart gave the following construction of an infinite dictionary, the "hyperwebster".

Construction of the Hyperwebster

An enterprising publishing company decides to print a book containing all the words that can possibly created from the English alphabet A-Z. Since it is really a collection of letters of the alphabet in any order, of any length, we will see:

some non-sensical words like "XWPBQI" and "NKPMZ",
some real words like "SUN" and "MOON" and
even some composite words like "SUNMOON" and "MOONSUN".

Since not all such words have meaning, the publishing company decides to include only the words in the book, without their associated meaning, if any.

So the "Hyperwebster" book looks like this:

A, AA, AAA, ..., AB, ABA, ABAA, ..., AC, ..., AZ, AZA, ...

B, BA, BAA, ..., BB, BBA, BBAA, ..., BC, ..., BZ, BZA, ...

C, CA, CAA, ..., CB, CBA, CBAA, ..., CC, ..., CZ, CZA, ...

Z, ZA, ZAA, ..., ZB, ZBA, ZBAA, ..., ZC, ..., ZZ, ZZA, ...

The staff at the publishing company realizes that it can partition the words into 26 volumes, one for each letter of the alphabet. So the Hyperwebster now looks like the following:

Volume A: A, AA, AAA, ..., AB, ABA, ABAA, ..., AC, ..., AZ, AZA, ...

Volume B: B, BA, BAA, ..., BB, BBA, BBAA, ..., BC, ..., BZ, BZA, ...

Volume C: C, CA, CAA, ..., CB, CBA, CBAA, ..., CC, ..., CZ, CZA, ...

Volume Z: Z, ZA, ZAA, ..., ZB, ZBA, ZBAA, ..., ZC, ..., ZZ, ZZA, ...

Next, the staff realizes that all words in volume A start with the letter A, all words in volume B start with the letter B and so on. This means that the first letter in each word can be inferred from its volume and hence, the first letter can be dropped. Excellent! The publishing company just saved some ink by not printing an infinite number of letters.

The new volumes now look like this:

Volume A: A, AA, AAA, ..., B, BA, BAA, ..., C, ..., Z, ZA, ...

Volume B: A, AA, AAA, ..., B, BA, BAA, ..., C, ..., Z, ZA, ...

Volume C: A, AA, AAA, ..., B, BA, BAA, ..., C, ..., Z, ZA, ...

Volume Z: A, AA, AAA, ..., B, BA, BAA, ..., C, ..., Z, ZA, ...

The staff realizes that each volume now looks identical, except for the name of the volume. Why would anyone buy 26 identical copies of the same content? So, the decision is made to publish a single volume called "Hyperwebster", which looks like the following:

New hyperwebster:

A, AA, AAA, ..., B, BA, BAA, ..., C, ..., Z, ZA, ...

This turns out to be identical to the original hyperwebster that they started out with.

Original hyperwebster:

A, AA, AAA, ..., AB, ABA, ABAA, ..., AC, ..., AZ, AZA, ...

B, BA, BAA, ..., BB, BBA, BBAA, ..., BC, ..., BZ, BZA, ...

C, CA, CAA, ..., CB, CBA, CBAA, ..., CC, ..., CZ, CZA, ...

Z, ZA, ZAA, ..., ZB, ZBA, ZBAA, ..., ZC, ..., ZZ, ZZA, ...

The staff realizes that:

the original volume can be partitioned into 26 different volumes,
the first letter in each volume can be dropped, making each volume identical,
and each volume now is really identical to the original volume
and steps 1-3 can be applied ad infinitum.

The publishing company wisely abandons publishing the hyperwebster, even though each execution of steps 1-4 represent an infinite amount of savings!

Moral of the story

The content of the hyperwebster is equivalent to points on a real line (replace A-Z above with 0-9 and observe that it generates all real numbers). Any subset of the real line can be chopped up into infinitely many parts, each of which has the same number of points as the original. Each of the parts in turn can be chopped up into infinitely many subparts, each having the same number of points as the original, ad infinitum. Yeah, that's a lot of points! Continuum hypothesis states that the number of such points is aleph-1.

References

Leonard M. Wapner, "The Pea and the Sun", 2005.

Wednesday, June 29, 2011

Continuum Hypothesis

This post looks at Cantor's continuum hypothesis and the relevant historical results around it.

Georg Cantor
Georg Cantor defined infinity in two steps. First he defined an infinite set and then "infinity":

an infinite set is one that can be put in one-to-one correspondence with a proper subset of itself. E.g. natural numbers can be put in a one-to-one correspondence with the set of even numbers, which is a proper subset of the natural numbers. Here is the correspondence {0, 1, 2, 3, ... } maps to {0, 2, 4, 6, ...}.
The cardinality of such a set is "infinity".

Cantor went a step ahead and defined a family of "infinities", each larger than the previous one.

Cantor proved that even an infinite set cannot be put in one-to-one correspondence with its power set (i.e. the set of all its subsets), whose cardinality is 2^n (if the cardinality of the original set is n).
This defines a family of "transfinite" cardinalities, each larger than the one before:

aleph-0 (the cardinality of natural numbers),
aleph-1 = 2^aleph-0,
aleph-2 = 2^aleph-1
and so on.

Real numbers cannot be put in one-to-one correspondence with natural numbers. (See Cantor's diagonal argument). The cardinality of real numbers is called the cardinality of the continuum, c.
Cantor hypothesized that c = aleph-1, which became known as the "Continuum Hypothesis". In other words, there is no transfinite cardinality between aleph-0 (cardinality of natural numbers) and c (cardinality of real numbers). But Cantor could not prove it. It turns out that there is a very good reason for that!
By the way, Cantor called natural numbers or any subset thereof as countable, since they can be counted, i.e. put in one-to-one correspondence with 0, 1, 2, 3, ... . He called real numbers and higher transfinite cardinalities as uncountable, because they cannot be put in one-to-one correspondence with 0, 1, 2, 3, ... . (See Cantor's diagonal argument).

Kurt Godel

Kurt Godel came along and gave the world two "incompleteness" theorems.

Informally, the first incompleteness theorem says that in any axiomatic system involving natural numbers, there are statements that cannot be proved or disproved within that system. In other words, there are some "undecidable" statements within the system.
Stated differently, we can never come up with a finite set of axioms that can prove or disprove every statement in that system. Hence the system is always "incomplete".
The intuition behind this is that there are only countably many provable statements, but uncountably many statements in the system. By the "pigeon hole" principle, some statements are unprovable. In fact, a vast majority of the statements are unprovable, since "uncountable" is much, much larger than "countable"!
The second incompleteness theorem says that for some axiomatic systems, consistency cannot be proved within the system itself.

Axiom of Choice

Simply stated, the axiom of choice says that given a bunch of non-empty sets, it is always possible to choose one element from each set to construct a new set.
It seems logical and harmless. But complications arise when the original set has a large cardinality, e.g. aleph-1. How do we go about choosing one element from uncountably many sets? Where do we start - it cannot be put in one-to-one correspondence with natural numbers and hence cannot be labeled 1, 2, 3, ..., etc. So we wouldn't know which set is the first one, which is second and so on.
One way to go about this complication is to just assume that there exists such a choice set and circumvent the above complexity.
Not everyone agrees! So there are two different axiomatic set theories - one without the axiom of choice (ZF for Zermelo-Frankl, the formulators of set theory) and another with the axiom of choice (ZFC).
Godel proved that the Axiom of Choice was consistent with ZF. In other words, ZFC is consistent.

The Finale

Godel also proved that the continuum hypothesis was consistent with axiomatic set theory. In other words, it cannot be disproved within ZF.
Paul Cohen comes along and proves that continuum hypothesis is independent of the other axioms in ZF. In other words, neither the continuum hypothesis nor its opposite can be proved within ZF. No wonder Cantor couldn't prove the "Continuum Hypothesis" from ZF axioms!
Well, what does it all mean? Continuum hypothesis must either be true or be false, since aleph-1 must be either equal to c or not equal to c. The bottom line is that the result by Cohen and Godel say that neither the equality nor the inequality can be proved within the axiomatic system, no matter how smart you are or how hard you try!
I, for one, believe that aleph-1 is indeed equal to c. In other words, there are no cardinalities in between that of the natural numbers and that of the real numbers.

References

Leonard M. Wapner, "The Pea and the Sun", 2005.

Friday, May 27, 2011

Experiments in graph 3-coloring - block-cutpoint graph

Introduction
One way to reduce the runtime complexity of an algorithm is to partition the problem into smaller sub-problems, solve the sub-problems and combine these solutions to solve the original, bigger problem. This approach can be used when coloring a graph.

I investigated an approach that uses a block-cutpoint representation of the given graph. On a high level, here is the solution:

Given an undirected graph, construct its block-cutpoint graph.
Solve the 3-coloring problem for each of the blocks.
Combine the solution for each of the blocks to 3-color the original graph.

Definitions
Bi-connected graph
An undirected graph is bi-connected if any two vertices lie on a cycle. In other words, there are at least 2 distinct, disjoint paths between any two vertices. This concept is similar to that of a strongly connected component in a directed graph.

Block (aka Bi-connected component)
A bi-connected subgraph of a graph is called a block.

Cutpoint (aka cut vertex)
A vertex whose removal disconnects a graph is called a cut point or a cut vertex. Such vertices are "on the boundary" of a block. In Figure 1, they are shown in red color.

Block-Cutpoint graph

A graph where a block is collapsed into a vertex,
a cut vertex is represented as a vertex in this graph,
there is an edge between a block vertex and a cut vertex only if the cut vertex is part of the block,
there is an edge between two cut vertices only if there are adjacent in the original graph.

Interesting point to note is that any undirected graph can be represented as a tree of blocks and cutpoints, as shown in the figure above.

Coloring algorithm

Partition given graph G into blocks B1, B2, ..., Bm.
Color each of the blocks independently.
The reconciliation step: two blocks, say B1 and B2, can share only 1 vertex (the cut vertex), say X.

If X is colored with the same color, then we simply merge the colorings for B1 and B2.
If X is colored differently in B1 and B2, say c1 and c2, respectively, then:

we make X's color in B2 as c1 (its color in B1)
All vertices in B2 colored as c2 are now colored as c1 and
all vertices in B2 colored as c1 are now colored as c2.

So in essence, we swapped colors c1 and c2 in B2.

If we perform the reconciliation step in breadth first search order (BFS) of the block-cutpoint tree, we have a valid coloring for the original graph, provided each of the blocks has a valid coloring.

Runtime complexity

Let the size of G be n.
Let the sizes of the blocks B1, B2, ..., Bm be n1, n2, ..., nm, respectively.

Since the worst case running time is exponential in the number of nodes, say a^n for G, the runtime complexity of the above algorithm is a^n1 + a^n2 + ... + a^nm + O(n). This translates to O(a^max(n1, n2, ..., nm)) asymptotically.

In closing
This algorithm can be applied to n-coloring, not just 3-coloring. It reduces the time complexity if there are cut vertices in the graph. The more even the size of the blocks, the more the speed-up in run time.

Wednesday, May 18, 2011

Experiments in graph 3-coloring

3-coloring using independent sets

By definition, vertices in an independent set (IS) share no edges. So these vertices can all be colored with one color.
If after the removal of these vertices (and incident edges), the graph is 2-colorable, then we have a valid 3-coloring.

Fig 1: An independent set (red vertices)

Fig 2: Graph without independent set

Fig 3: Graph without independent set is 2-colorable

Fig 4: Fully colored graph with 3 colors

Interesting points about this approach

Not all independent sets work, since G - IS may not be 2-colorable.
Enumerating all independent sets is equivalent to enumerating the power set of the vertices and determining if they form an independent set. This has 2^v iterations, each taking about v^2 time. This is asymptotically better than 3^v iterations required for a brute force 3-coloring.
If a graph is 3-colorable, then we can always find one color in a valid 3-coloring that is applied to at most floor(v/3) vertices. This implies that when enumerating all independent sets, we need to only find independent sets of size floor(v/3) or less. This is still O(2^n), by the way.

Optimization for this approach

We start with brute force enumeration, as below, where 1 implies inclusion of the vertex and 0 its exclusion from the independent set.

{1, 0, 0, ..., 0} - start by choosing v0
{1, 1, 0, ...., 0} - after we choose v0 and v1 in the independent set.
{1, 0, 1, ...., 0} - if we decide that v0 and v1 cannot be in an independent set (because they share an edge) and move on to v0 and v2.

In general, if current selection is an independent set, keep the latest added vertex and include the next available vertex.
In general, if current selection is not an independent set (because the latest vertex has one or more edges with previous vertices in the independent set), drop the latest added vertex and choose the next available one.

3-coloring with brute force

This is exhaustive, brute force approach. We try each combination, till we find a valid coloring.
Let the colors be 0, 1 and 2. Each color combination is an n-tuple. E.g. {0, 2, 1, 0, 0, 1, 1, 2, ...}. This colors vertex v0 with color 0, v1 with color 2, v2 with color 1 and so on.
This is O(3^n), but seems to run faster in practice. Perhaps it has a tighter bound than O(3^n).

Optimization for this approach

We order vertices arbitrarily as v0, v1, v2, ..., vn.
We enumerate current color selection lexicographically. E.g.

0
0, 0, if the previous selection is a valid coloring
0, 0, 0 if the previous selection is a valid coloring
0, 0, 1 if the previous selection is not a valid coloring
0, 0, 2 if the previous selection is not a valid coloring
0, 1 if the previous selection is not a valid coloring
0, 1, 0 if the previous selection is a valid coloring
and so on.

3-coloring greedily

This is a simple approach, very fast in practice, but cannot color all colorable graphs.
Order vertices by descending order of degree.
Greedily choose the lowest color that is valid.
Repeat above step till all vertices are colored.
Runs in O(v^2) time.

Sunday, April 3, 2011

Flajolet-Martin algorithm

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass. If the stream contains $n$ elements with $m$ of them unique, this algorithm runs in $O(n)$ time and needs $O(log(m))$ memory. So the real innovation here is the memory usage, in that an exact, brute-force algorithm would need $O(m)$ memory (e.g. think "hash map").

As noted, this is an approximate algorithm. It gives an approximation for the number of unique objects, along with a standard deviation $\sigma$, which can then be used to determine bounds on the approximation with a desired maximum error $\epsilon$, if needed.

Given below are the following:

intuition behind the algorithm
the algorithm itself
a java-based implementation
some results using that implementation and
some closing thoughts.

Intuition
If we had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:

$1/2$ of them to have their binary representation end in $0$ (i.e. divisible by $2$),
$1/4$ of them to have their binary representation end in $00$ (i.e. divisible by $4$)
$1/8$ of them to have their binary representation end in $000$ (i.e. divisible by $8$)
and in general, $1/2^n$ of them to have their binary representation end in $0^n$.

Turning the problem around, if the hash function generated an integer ending in $0^m$ bits (and it also generated integers ending in $0^{m-1}$ bits, $0^{m-2}$ bits, ..., $0^1$ bits), intuitively, the number of unique strings is around $2^m$.

To facilitate the above, this algorithm maintains 1 bit for each $0^i$ seen - i.e. 1 bit for 0, another for 00, another for 000, and so on. The output of the algorithm is based on the maximum of consecutive $0^i$ seen.

The Flajolet-Martin algorithm
This is an informal description. Formal treatment can be found in the original paper listed in the reference section.

Create a bit vector (bit array) of sufficient length $L$, such that $2^L>n$, the number of elements in the stream. Usually a 64-bit vector is sufficient since $2^{64}$ is quite large for most purposes.
The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in $0^i$. So initialize each bit to 0.
Generate a good, random hash function that maps input (usually strings) to natural numbers.
Read input. For each word, hash it and determine the number of trailing zeros. If the number of trailing zeros is k, set the k-th bit in the bit vector to 1.
Once input is exhausted, get the index of the first 0 in the bit array (call this R). By the way, this is just the number of consecutive 1s (i.e. we have seen 0, 00, ..., $0^{R-1}$ as the output of the hash function) plus one.
Calculate the number of unique words as $2^R/\phi$, where $\phi$ is 0.77351. A proof for this can be found in the original paper listed in the reference section.
The standard deviation of R is a constant: $\sigma(R)=1.12$. (In other words, R can be off by about 1 for 1-0.68=32% of the observations, off by 2 for about 1-0.95=5% of the observations, off by 3 for 1-0.997=0.3% of the observations using the Empirical rule of statistics). This implies that our count can be off by a factor of 2 for 32% of the observations, off by a factory of 4 for 5% of the observations, off by a factor of 8 for 0.3% of the observations and so on.

To improve accuracy of this approximation algorithm, we do the following:

(Averaging) Use multiple hash functions and use the average R instead.
(Bucketing) Averages are susceptible to large fluctuations. So use multiple buckets of hash functions from the above step and use the median of the average R. This gives fairly good accuracy.
Overall accuracy of this algorithm can be tuned by using appropriate number of hash functions in the averaging and bucketing steps. Of course, if more accuracy is desired, more hash functions need to be used, which implies higher computation cost.

Java-based implementation
The code can be found here: FlajoletMartin.java. It uses lucene-core-3.0.3.jar or better.

Results using the above implementation

Wikipedia article on "United States Constitution" had 3978 unique words. When run ten times, Flajolet-Martin algorithm reported values of 4902, 4202, 4202, 4044, 4367, 3602, 4367, 4202, 4202 and 3891 for an average of 4198. As can be seen, the average is about right, but the deviation is between -400 to 1000.
Wikipedia article on "George Washington" had 3252 unique words. When run ten times, the reported values were 4044, 3466, 3466, 3466, 3744, 3209, 3335, 3209, 3891 and 3088, for an average of 3492.

Closing thoughts

The Flajolet-Martin algorithm approximates the number of unique elements quite well, using just O(log m) memory, where m is the number of unique words.
During implementation, it was observed that this algorithm is quite sensitive to the hash function parameters. The hash functions suggested in the original paper ($h(x)=(M+N\sum{ord(x_j)*128^j)\mod(2^L)}$) work properly only when n is odd. Otherwise, the algorithm always reports the number of unique elements as 1 or 2! I chose both $M$ and $N$ as odd, as can be seen in the implementation.

References

Flajolet, P. and Martin, N., Journal of Computer and System Sciences, 1985. PDF - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.3869&rep=rep1&type=pdf
J. Ullman and A. Rajaraman, Mining of Massive Datasets, Chapter 3 - available at http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

About Me

Monday, August 1, 2011

Sunday, July 10, 2011

Saturday, July 2, 2011

Wednesday, June 29, 2011

Friday, May 27, 2011

Wednesday, May 18, 2011

Sunday, April 3, 2011