https://github.com/etcd-io/etcd以及迭代到v3版本,是很多中间件的核心组件,比如k8s,下面我们将通过一系列文章分析下它的源码和设计。其中部分内容翻译自官方文档https://etcd.io/docs/v3.5/install/。
首先尝试下源码安装:进入源码目录编译
% cd etcd% ./scripts/build.sh(cd etcdctl && env GO_BUILD_FLAGS= CGO_ENABLED=0 GO_BUILD_FLAGS= GOOS=darwin GOARCH=amd64 go build -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=8da2a5b -o=../bin/etcdctl .)SUCCESS: etcd_build (GOARCH=amd64)
编译完成后查看下版本:
% ./bin/etcd --versionetcd Version: 3.6.0-alpha.0Git SHA: 8da2a5bGo Version: go1.19Go OS/Arch: darwin/amd64
把它添加到path
% export PATH="$PATH:`pwd`/bin"
然后启动server端
% etcd{"level":"warn","ts":"2023-06-07T09:17:23.681691+0800","caller":"embed/config.go:708","msg":"Running http and grpc server on single port. This is not recommended for production."}
etcd最小可操作单元是k/v,我们可以通过etcdctl来操作
% etcdctl put greeting "Hello, etcd"OK
% etcdctl get greetinggreetingHello, etcd
etcd的service主要分为两类:
处理k/v相关的,Services important for dealing with etcd’s key space include
KV - Creates, updates, fetches, and deletes key-value pairs.
Watch - Monitors changes to keys.
Lease - Primitives for consuming client keep-alive messages.
处理集群相关的,Services which manage the cluster itself include:
Auth - Role based authentication mechanism for authenticating users.
Cluster - Provides membership information and configuration facilities.
Maintenance - Takes recovery snapshots, defragments the store, and returns per-member status information.
比如KV的查询,核心接口如下
service KV {Range(RangeRequest) returns (RangeResponse)...}
etcd所有的api返回结果里都增加了Response header,包括集群的元信息:All Responses from etcd API have an attached response header which includes cluster metadata for the response。具体内容如下
message ResponseHeader {uint64 cluster_id = 1;uint64 member_id = 2;int64 revision = 3;uint64 raft_term = 4;}
k/v对是api可操作的最小单元,它的定义如下:
message KeyValue {bytes key = 1;int64 create_revision = 2;int64 mod_revision = 3;int64 version = 4;bytes value = 5;int64 lease = 6;}
用etcd实现的分布式锁是通过创建版本号来获取锁的所有权。修改版本号用户mvcc场景下检测版本是否冲突,实现cas逻辑的。etcd内部维护了一个64位的集群粒度的计数器,存储的版本号会随着key修改的次数增加,版本号可以作为逻辑上的一个全局锁。给存储的所有更新排序。etcd maintains a 64-bit cluster-wide counter, the store revision, that is incremented each time the key space is modified. The revision serves as a global logical clock, sequentially ordering all updates to the store. The change represented by a new revision is incremental; the data associated with a revision is the data that changed the store. Internally, a new revision means writing the changes to the backend’s B+tree, keyed by the incremented revision.
etcd的数据模型会给所有的二进制key建设一个打平的索引。查询的请求和返回定义如下:
message RangeRequest {enum SortOrder {NONE = 0; default, no sortingASCEND = 1; lowest target value firstDESCEND = 2; highest target value first}enum SortTarget {KEY = 0;VERSION = 1;CREATE = 2;MOD = 3;VALUE = 4;}bytes key = 1;bytes range_end = 2;int64 limit = 3;int64 revision = 4;SortOrder sort_order = 5;SortTarget sort_target = 6;bool serializable = 7;bool keys_only = 8;bool count_only = 9;int64 min_mod_revision = 10;int64 max_mod_revision = 11;int64 min_create_revision = 12;int64 max_create_revision = 13;}
message RangeResponse {ResponseHeader header = 1;repeated mvccpb.KeyValue kvs = 2;bool more = 3;int64 count = 4;}
修改的请求定义类似,同样还有删除的:
message PutRequest {bytes key = 1;bytes value = 2;int64 lease = 3;bool prev_kv = 4;bool ignore_value = 5;bool ignore_lease = 6;}
message PutResponse {ResponseHeader header = 1;mvccpb.KeyValue prev_kv = 2;}
etcd把一个事务操作,抽象为一个原子的If/Then/Else模型:A transaction is an atomic If/Then/Else construct over the key-value store.Transactions can be used for protecting keys from unintended concurrent updates, building compare-and-swap operations, and developing higher-level concurrency control.All comparisons are applied atomically; if all comparisons are true, the transaction is said to succeed and etcd applies the transaction’s then success request block, otherwise it is said to fail and applies the else / failure request block.
上述模型会对应三个操作:
message Compare {enum CompareResult {EQUAL = 0;GREATER = 1;LESS = 2;NOT_EQUAL = 3;}enum CompareTarget {VERSION = 0;CREATE = 1;MOD = 2;VALUE= 3;}CompareResult result = 1;// target is the key-value field to inspect for the comparison.CompareTarget target = 2;// key is the subject key for the comparison operation.bytes key = 3;oneof target_union {int64 version = 4;int64 create_revision = 5;int64 mod_revision = 6;bytes value = 7;}}
message RequestOp {// request is a union of request types accepted by a transaction.oneof request {RangeRequest request_range = 1;PutRequest request_put = 2;DeleteRangeRequest request_delete_range = 3;}}
All together, a transaction is issued with a Txn API call, which takes a TxnRequest:
message TxnRequest {repeated Compare compare = 1;repeated RequestOp success = 2;repeated RequestOp failure = 3;}
事务的结果如下:
message TxnResponse {ResponseHeader header = 1;bool succeeded = 2;repeated ResponseOp responses = 3;}
message ResponseOp {oneof response {RangeResponse response_range = 1;PutResponse response_put = 2;DeleteRangeResponse response_delete_range = 3;}}message Event {enum EventType {PUT = 0;DELETE = 1;}EventType type = 1;KeyValue kv = 2;KeyValue prev_kv = 3;}
Watches are long-running requests and use gRPC streams to stream event data.A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers.
watch的语意实现了三个要素,有序、可靠、原子性。Watches make three guarantees about events:
Ordered - events are ordered by revision; an event will never appear on a watch if it precedes an event in time that has already been posted.
Reliable - a sequence of events will never drop any subsequence of events; if there are events ordered in time as a < b < c, then if the watch receives events a and c, it is guaranteed to receive b.
Atomic - a list of events is guaranteed to encompass complete revisions; updates in the same revision over multiple keys will not be split over several lists of events.
message WatchCreateRequest {bytes key = 1;bytes range_end = 2;int64 start_revision = 3;bool progress_notify = 4;enum FilterType {NOPUT = 0;NODELETE = 1;}repeated FilterType filters = 5;bool prev_kv = 6;}
租约是一种客户端的保活机制,当收不到心跳的时候,就认为客户端挂掉了。Leases are a mechanism for detecting client liveness. The cluster grants leases with a time-to-live. A lease expires if the etcd cluster does not receive a keepAlive within a given TTL period.
message LeaseGrantRequest {int64 TTL = 1;int64 ID = 2;}
message LeaseRevokeRequest {int64 ID = 1;}
Leases are refreshed using a bi-directional stream created with the LeaseKeepAlive API call.






