Pierre Zemb's Blog

Notes about ETCD

Table of contents

etcd image

Notes About is a blogpost serie you will find a lot of links, videos, quotes, podcasts to click on about a specific topic. Today we will discover ETCD.

πŸ”—Overview of ETCD

As stated in the official documentation:

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.

πŸ”—History

ETCD was initially developed by CoreOS:

CoreOS built etcd to solve the problem of shared configuration and service discovery.

πŸ”—Overall architecture

The etcd key-value store is a distributed system intended for use as a coordination primitive. Like Zookeeper and Consul, etcd stores a small volume of infrequently-updated state (by default, up to 8 GB) in a key-value map, and offers strict-serializable reads, writes and micro-transactions across the entire datastore, plus coordination primitives like locks, watches, and leader election. Many distributed systems, such as Kubernetes and OpenStack, use etcd to store cluster metadata, to coordinate consistent views over data, to choose leaders, and so on.

ETCD is:

πŸ”—Consensus? Raft?

, Paxos is quite difficult to understand, inspite of numerous attempts to make it more approachable.Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

Raft separates the key elements of consensus, such asleader election, log replication, and safety

ETCD contains several raft optimizations:

πŸ”—Exposed API

ETCD is exposing several APIs through different gRPC services:

Key and values are bytes-oriented but ordered.

πŸ”—Transactions

// From google paxosdb paper:
// Our implementation hinges around a powerful primitive which we call MultiOp. All other database
// operations except for iteration are implemented as a single call to MultiOp. A MultiOp is applied atomically
// and consists of three components:
// 1. A list of tests called guard. Each test in guard checks a single entry in the database. It may check
// for the absence or presence of a value, or compare with a given value. Two different tests in the guard
// may apply to the same or different entries in the database. All tests in the guard are applied and
// MultiOp returns the results. If all tests are true, MultiOp executes t op (see item 2 below), otherwise
// it executes f op (see item 3 below).
// 2. A list of database operations called t op. Each operation in the list is either an insert, delete, or
// lookup operation, and applies to a single database entry. Two different operations in the list may apply
// to the same or different entries in the database. These operations are executed
// if guard evaluates to
// true.
// 3. A list of database operations called f op. Like t op, but executed if guard evaluates to false.
message TxnRequest {
  // compare is a list of predicates representing a conjunction of terms.
  // If the comparisons succeed, then the success requests will be processed in order,
  // and the response will contain their respective responses in order.
  // If the comparisons fail, then the failure requests will be processed in order,
  // and the response will contain their respective responses in order.
  repeated Compare compare = 1;
  // success is a list of requests which will be applied when compare evaluates to true.
  repeated RequestOp success = 2;
  // failure is a list of requests which will be applied when compare evaluates to false.
  repeated RequestOp failure = 3;
}

πŸ”—Versioned data

Each Key/Value has a revision. When creating a new key, revision starts at 1, and then will be incremented each time the key is updated.

In order to avoid having a growing keySpace, one can issue the Compact gRPC service:

Compacting the keyspace history drops all information about keys superseded prior to a given keyspace revision

πŸ”—Lease

// this message represent a Lease
message Lease {
  // TTL is the advisory time-to-live in seconds. Expired lease will return -1.
  int64 TTL = 1;
  // ID is the requested ID for the lease. If ID is set to 0, the lessor chooses an ID.
  int64 ID = 2;

  int64 insert_timestamp = 3;
}

πŸ”—Watches

message Watch {
  // key is the key to register for watching.
  bytes key = 1;

  // range_end is the end of the range [key, range_end) to watch. If range_end is not given,
  // only the key argument is watched. If range_end is equal to '\0', all keys greater than
  // or equal to the key argument are watched.
  // If the range_end is one bit larger than the given key,
  // then all keys with the prefix (the given key) will be watched.
  bytes range_end = 2;

  // If watch_id is provided and non-zero, it will be assigned to this watcher.
  // Since creating a watcher in etcd is not a synchronous operation,
  // this can be used ensure that ordering is correct when creating multiple
  // watchers on the same stream. Creating a watcher with an ID already in
  // use on the stream will cause an error to be returned.
  int64 watch_id = 7;
}

πŸ”—Linearizable reads

Section 8 of the raft paper explains the issue:

Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second,a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.

ETCD implements ReadIndex read(more info on Diving into ETCD’s linearizable reads).

πŸ”—How ETCD is using bbolt

bbolt is the underlying kv used in etcd. A bucket called key is used to store data, and the key is the revision. Then, to find keys, a B-Tree is used.

From an Github issue:

Note that the underlying bbolt mmap its file in memory. For better performance, usually it is a good idea to ensure the physical memory available to etcd is larger than its data size.

πŸ”—ETCD in K8S

The interface can be found here.

πŸ”—Jepsen

The Jepsen team tested etcd-3.4.3, here's some quotes:

In our tests, etcd 3.4.3 lived up to its claims for key-value operations: we observed nothing but strict-serializable consistency for reads, writes, and even multi-key transactions, during process pauses, crashes, clock skew, network partitions, and membership changes.

Watches appear correct, at least over single keys. So long as compaction does not destroy historical data while a watch isn’t running, watches appear to deliver every update to a key in order.

However, etcd locks (like all distributed locks) do not provide mutual exclusion. Multiple processes can hold an etcd lock concurrently, even in healthy clusters with perfectly synchronized clocks.

If you use etcd locks, consider whether those locks are used to ensure safety, or simply to improve performance by probabilistically limiting concurrency. It’s fine to use etcd locks for performance, but using them for safety might be risky.

πŸ”—Operation notes

πŸ”—Deployements tips

From the official documentation:

Since etcd writes data to disk, SSD is highly recommended. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it.

πŸ”—Defrag

After compacting the keyspace, the backend database may exhibit internal fragmentation. Defragmentation is issued on a per-member so that cluster-wide latency spikes may be avoided.

Defrag is basically dumping the bbolt tree on disk and reopening it.

πŸ”—Snapshot

An ETCD snapshot is related to Raft's snapshot:

Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded

Snapshot can be saved using etcdctl:

etcdctl snapshot save backup.db

πŸ”—Lease

Be careful on Leader's change and lease, this can create some issues:

The new leader extends timeouts automatically for all leases. This mechanism ensures no lease expires due to server side unavailability.

πŸ”—War stories

Tags: #notes-about #etcd