etcd's Improvements to Raft

Introduction

I recently studied the course "etcd 实战课", or "Practical etcd" in English. The author is truly an expert, getting straight to the point on many issues, which has been very insightful for me. etcd is a vast and complex system; explaining the principles of just one component like MVCC, Watch, or Lease would be impossible in a single blog post, let alone the entire etcd. Therefore, this post focuses on etcd's improvements to Raft. Specifically, we'll start from the algorithm presented in the Raft paper, using the MIT 6.824 lab 3 project (a simple K/V service built atop Raft) as a baseline, to see what optimizations etcd has made. This article covers the ReadIndex optimization for read operations, the PreCandidate mechanism to avoid unnecessary elections, and some other minor optimizations.

ReadIndex: Optimizing Read Operations

The approach of MIT 6.824's K/V service for handling read operations is the same as for writes: only the Leader can process the request. When a request is received, the Leader replicates the new log entry to Followers over the network. Once a majority of Followers have persisted this log entry, the Leader commits it, applies it to the state machine, and then returns the result to the client.

The advantage of this method is the simplicity of coding, but the problem is that disk I/O latency significantly impacts request latency. The state machine is stored in memory; it seems unreasonable to require writing a log entry to disk first just to read data from memory.

The Raft paper mentions an optimization for read operations to avoid persisting log entries: after receiving a read request, the Leader first sends heartbeat packets to other nodes in the cluster. If a majority responds, confirming the Leader's legitimacy, it then reads the data directly from memory and returns it to the client.

etcd's linearizable reads take this a step further. Through the ReadIndex mechanism, it allows Followers to handle read requests while still guaranteeing linearizability. When a Follower receives a read request, it first queries the Leader for its commitIndex. Upon receiving this ReadIndex request, the Leader also sends heartbeat packets to other nodes in the cluster to confirm its legitimacy, then returns its commitIndex to the Follower. After receiving it, the Follower continuously applies log entries to its state machine until its lastApplied catches up to the Leader's commitIndex. Only then does the Follower process the read request, reading data from its state machine and returning it to the client.

etcd's support for linearizable read operations on Followers undoubtedly greatly improves QPS. Additionally, we can see that etcd is suitable for read-heavy, write-light scenarios. If there are many write requests, the time for a Follower to catch up to the Leader's commitIndex will be long, leading to high latency for read requests.

PreCandidate: Avoiding Unnecessary Elections

When a Follower loses connection with other nodes due to network issues, it may not receive the Leader's heartbeat packets, triggering an election timeout and transitioning to Candidate state. Since no nodes vote for it, the election timeout fires repeatedly, causing it to continuously increment its term and enter a new Candidate state. When the network recovers, this Follower's term is far greater than the current Leader's term, forcing the Leader to step down and triggering a new election. Although the cluster will eventually return to an available state, it's undesirable to disrupt the existing available state due to a few disconnected nodes. Therefore, we want to avoid such unnecessary elections to maintain high availability.

etcd introduces a PreCandidate state to solve this problem. When an election timeout fires, the Follower first transitions to PreCandidate. At this stage, it does not increment its term. Instead, it initiates a pre-vote. Similar to a formal vote, other nodes will only agree to this pre-vote if the node's data is sufficiently up-to-date and it has the potential to become a Leader. Only after receiving pre-votes from a majority of nodes can the PreCandidate transition to Candidate and begin a formal election.

Other Optimizations

1️⃣ etcd implements a rate-limiting check for write requests: if currently commitIndex - lastApplied > 5000, it returns a "too many requests" error to the client instead of generating a new log entry. This proactively helps avoid a series of problems like high read request latency.

2️⃣ After taking a snapshot, etcd still retains the latest 5000 log entries. This allows the Leader to minimize the need to send a full snapshot to a Follower, reducing network and other overheads.

References