
Figure 1: Replicated state machine architecture. The con-
sensus al gorithm manages a replicated log containing state
machine commands f rom clients. The state machines process
identical sequences of commands from the logs, so they pro-
duce the same outputs.
used to solve a variety of fault tolerance problems in dis-
tributed systems. For example, large-scale systems that
have a single cluster leader, such as GFS [8], HDFS [38],
and RAMCloud [33], typically use a separate replicated
state machine to manage leader election and stor e config-
uration information that must survive leader crashes. Ex-
amples of replicated state machines include Chubby [2]
and ZooKeeper [11].
Replicated state machines are typically implemented
using a replicated log, as shown in Figure 1. Each server
stores a log containing a series of comma nds, which its
state ma chine executes in order. Each log contains the
same comma nds in the same order, so each state ma-
chine processes the same seq uence of commands. Since
the state machines are determin istic, each computes the
same state and the same sequen c e of outputs.
Keeping the replicated log consistent is the job of the
consensus algorithm. The consensus modu le on a server
receives commands from clients and adds them to its log.
It communicate s with the consensus m odules on other
servers to ensure that every log eventually contains the
same requests in the same order, even if some servers fail.
Once c ommands are properly re plicated, each server’s
state m a chine processes them in log order, and the out-
puts are returned to clients. As a result, the servers appear
to form a single, high ly reliable state machine.
Consensus algorithms for practica l systems typically
have the f ollowing properties:
• They ensure safety (never returning an incorrect re-
sult) under all non-Byzantine conditions, including
network delays, partitions, and packet loss, duplica-
tion, and reo rdering.
• They are fully functional (available) as long as any
majority o f the servers are operational and can com-
municate with each other and with clients. Thus, a
typical cluster of five servers can tolerate the failure
of any two servers. Servers are assumed to fail by
stopping; they may later recover from state on stable
storage and rejoin the cluster.
• They do not depend on timing to ensure the consis-
tency of the logs: faulty clocks and extreme message
delays can, at worst, cause availability problems.
• In the co mmon case, a command can complete as
soon as a majority of the cluster has r e sponded to a
single round of remote procedure c alls; a minority of
slow servers need not impact overall system perfor-
mance.
3 What’s wrong with Paxos?
Over the last ten years, Leslie Lamport’s Paxos proto-
col [15] has become almost synonymous with consensus:
it is the protocol most commonly taught in courses, and
most imp le mentations o f consensus use it as a starting
point. Paxos first defines a protocol c apable of reaching
agreement on a single decision, such as a single replicated
log entry. We refer to this subset as single-decree Paxos.
Paxos then combines multiple instances of this protocol to
facilitate a series of decisions such as a log (m ulti-Paxos).
Paxos ensure s both safety and liveness, and it supports
changes in c luster membership. Its correctness has been
proven, and it is efficient in the normal case.
Unfortu nately, Paxos has two significant drawbacks.
The first drawback is that Paxos is exceptionally diffi-
cult to understand. The full explanation [15] is noto ri-
ously opaqu e; few pe ople succeed in und e rstanding it, and
only with great effort. As a result, there have been several
attempts to explain Paxos in simpler terms [16, 20, 21].
These explanations focus on the single-decree subset, yet
they are still ch allenging. In an informal survey of atten-
dees at NSDI 2012, we found few people who were com-
fortable with Paxos, even amon g seasoned researchers.
We struggled with Paxos ourselves; we were not able to
understand the complete protocol until after reading sev-
eral simplified explanations and designing our own a lter-
native protocol, a process that took almost a year.
We hypothesize that Paxos’ opaqueness derives from
its choice of the single-decr ee subset as its foundation.
Single-decree Paxos is dense and subtle: it is divided into
two stages that do n ot have simple intuitive explanations
and cannot be understood independently. Because of this,
it is difficult to develop intuitions about why the single-
decree protocol works. The composition rules for multi-
Paxos add significant additional complexity and subtlety.
We believe that the overall problem of reaching consensus
on multip le decisions (i.e., a log in stead of a single entry)
can be decomposed in other ways that are more direct and
obvious.
The second problem with Paxos is that it does not pro-
vide a good foundation for building practical implemen-
tations. One re ason is that there is no widely agreed-
upon algorithm for multi-Paxos. Lamport’s descriptions
are mostly about single-decree Paxos; he sketched possi-
ble approaches to multi-Paxos, but many details are miss-
ing. There have been several attempts to flesh out and op-
timize Paxos, such as [26], [39], and [13], but these differ
2
评论