Ravi is an armchair futurist and an aspiring mad scientist. His mission is to create simplicity out of complexity and order out of chaos.

Saturday, April 17, 2010

CAP theorem and its implications

CAP theorem
No distributed system can guarantee "consistency" (C), "availability" (A) and "partition tolerance" (P) at the same time.

Definition of Terms
The words "consistency", "availability" and "partition tolerance" are a common source of confusion as they mean different things to different people. Here is the definition by Gilbert and Lynch in their formal proof of the theorem, followed by my notes.

• Consistency: "There must exist a total order on all operations, such that each operation looks as if it were completed at a single instant." In software engineering parlance, this is equivalent to having the system behave like it is executing requests in a single thread, in some sequence. A system is therefore consistent if each request sees the state of the system as if it were the only request being processed by the system at that time.
• Availability: "For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response." In other words, the system is available if it is able to respond. The formal proof makes no distinction between arbitrarily long response times and failure to respond.
• Partition Tolerance: "In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another." A system is partitioned if messages from one component to another are lost. This partitioning could be temporary or permanent, but that doesn't affect the outcome of the proof. In other words, a system is partition-tolerant if it responds correctly in the event of a network or node failure. Also, per the formal proof, "no set of failures less than total network failure is allowed to cause the system to respond incorrectly". As a quick mnemonic, I prefer to think of this property as "fault tolerance".

Implications of the theorem
Large systems need horizontal scale and hence, partition-tolerance. E.g. if one machine becomes unavailable, the system shouldn't fail. This implies that large systems only have "consistency" and "availability" to play with.

Consistency over availability
Some systems just need to be consistent. Banks and financial institutions fall in this category. It is easy to see why.  If money is transferred from account A to account B, you don't want to see it disappear from both A and B (the bank may like it but A and B won't) nor do you want to see it present in both (A and B will like it, but the banks won't!). value consistency over availability (e.g. banks), while others value availability over consistency .

Availability over consistency
This is the choice made by most large systems today (e.g. eBay, Amazon). It turns out that sacrificing consistency isn't as big a deal as it might initially feel if inconsistencies get reconciled within an acceptable time frame. This is the concept of  "eventual consistency", as opposed to total "consistency". If you list an item on eBay, you may be able to search for it only when the search sub-system gets consistent with the listing subsystem; however, both remain available while the system is in an inconsistent state.

Closing remarks
Just as a physical system has inherent limits (e.g. the speed of light or absolute zero temperature), a distributed system has one such limit involving the parameters of consistency, availability and partition tolerance. This limit helps us in making informed decisions when optimizing on the parameters that the system values most.

References

1 comment:

1. The more I read on, the more it seems like common sense. Except, it aint all that common.

Good work.