Reading Reflection on the Dynamo Paper
Introduction
From a learning perspective, the Dynamo paper is well-written. Firstly, it explains many fundamental concepts in plain language, such as consistent hashing, vector clocks, and Merkle Trees. Secondly, its design is outstanding, offering much for study and reference. Finally, it further decouples and analyzes problems we often take for granted, deepening my understanding of the relevant knowledge.
This post will cover the following:
- Some fundamental concepts: Consistent Hashing, Vector Clocks, SLA (Service Level Agreement).
- Design considerations for Dynamo, including the "always writable" design goal and the problem of data conflict and its resolution.
- Dynamo's workflow: A brief introduction to its read and write processes, followed by a focus on the configuration and principles of the three parameters
(N, R, W). - Comparison of different partitioning strategies, primarily contrasting consistent hashing with fixed partitioning strategies.
Fundamental Concepts
Consistent Hashing
I initially considered omitting consistent hashing since I already understood it before reading the paper. However, for the convenience of describing Dynamo later, I'll include it here.
First, connect the range of the hash function (typically an unsigned integer, e.g., $0 \sim 2^{64} - 1$) end-to-end (after $2^{64} - 1$ comes $0$), forming a hash ring. Then, map nodes onto this hash ring. When we want to store or retrieve data, compute the hash of the data's key, map it to a point on the ring, and then move clockwise from that point; the first node encountered is the node responsible for storing that data.
The primary advantage of consistent hashing is that when nodes leave or join the cluster, only adjacent nodes on the ring are involved in data migration, thereby improving service availability.
A common optimization is the use of virtual nodes, where a single physical node corresponds to multiple virtual nodes on the ring. This helps achieve a more balanced load during storage and data migration and reflects node heterogeneity (e.g., a node with larger storage capacity can be assigned more virtual nodes).
Vector Clocks
Vector clocks are an optimistic concurrency control mechanism proposed to handle conflicts arising from concurrent data updates. They are a form of metadata formatted as [(node_1, counter_1), ..., (node_n, counter_n)]. For index k, node_k is the node's ID, and counter_k is an integer representing the number of times that node has modified the current data. For example, suppose a piece of data is modified sequentially by nodes A, B, A, and C. Its corresponding vector clocks would be [(A, 1)], [(A, 1), (B, 1)], [(A, 2), (B, 1)], and [(A, 2), (B, 1), (C, 1)].
Of course, real-world systems are more complex: concurrent updates from multiple nodes at the same moment can generate multiple vector clocks; or network jitter causing message loss might lead a node to see vector clocks that are not sequential. In any case, based on the definition of vector clocks, we can determine if data conflicts have occurred. If all counter values for the nodes in one vector clock are less than or equal to their counterparts in another vector clock, then there is a causality between the two data versions, allowing a clear distinction between old and new data. Otherwise, we know the data was updated concurrently, resulting in a conflict.
Additionally, it's easy to see that version number mechanisms can be viewed as a highly simplified and specialized subset of vector clocks.
SLA
Online explanations of SLA tend to be somewhat complicated, whereas the paper presents it very straightforwardly. I'll quote it directly here:
Clients and services engage in a Service Level Agreement (SLA), a formally negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client's expected request rate distribution for a particular API and the expected service latency under those conditions. An example of a simple SLA is a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.
Design Considerations for Dynamo
Based on Amazon's business requirements, Dynamo's design goal is an "always writable" storage service—in other words, write operations are highly available. To guarantee this, it must allow for data conflicts, temporarily sacrificing consistency. Dynamo is also an eventually consistent storage service, so data conflicts must be resolved eventually. This introduces two problems: when to resolve conflicts and who resolves them.
Most systems resolve data conflicts during write operations, which implies that write operations are not always available, while read operations have much higher availability. This is unacceptable for Dynamo, so Dynamo chooses to resolve conflicts during read operations. Specifically, for a read request, Dynamo reads the same data from multiple nodes, uses the vector clocks associated with those data to determine if conflicts exist, and attempts to resolve them. Dynamo does not perform conflict detection when processing write requests.
Using vector clocks to resolve conflicts is relatively simple: if multiple vector clocks do not conflict, the latest version can be used. However, when a conflict occurs, external intervention is required: either the storage service resolves it, or the application does. If the storage service resolves conflicts, it can only take simple strategies like "last write wins." If the application resolves them, developers can choose appropriate strategies based on business logic. In Dynamo, when multiple data versions are detected during a read request, Dynamo returns all these data versions along with their vector clocks to the client. Since the client must specify the version it is writing to during each write operation, reading multiple versions and then performing a write effectively resolves the conflict.
Dynamo's Workflow
This section explains Dynamo's workflow under normal conditions, as considering all edge cases would be overly complex.
First, Dynamo is a key-value database. Second, it uses consistent hashing for data partitioning. To ensure reliable data storage, Dynamo replicates each piece of data across N nodes. Here, N is a configurable parameter. When a data item's key is hashed and mapped onto the ring, the first N nodes encountered clockwise from that point will store the data.
Client read/write requests are typically routed to the first one among the N nodes to handle the request; this node is called the coordinator. For a read request, the coordinator sends requests to the N nodes. If it receives R responses, the coordinator checks if the data has conflicts, attempts conflict resolution, and returns the result to the client. Similarly, for a write request, the coordinator sends requests to the N nodes. If it receives W successful write acknowledgments, the coordinator returns a write success to the client. Here, W and R are also configurable parameters.
Thus, we have three parameters: (N, R, W). To ensure data consistency, we require $R + W > N$. We can understand this inequality as follows: let $S_R$ be the set of nodes corresponding to R and $S_W$ the set corresponding to W. Since only N nodes store a copy of the data, when $R+W > N$, it necessarily implies $S_R \cap S_W \ne \emptyset$. This means that during a read-after-write, there is always at least one node that handled both requests, enabling it to return the latest version of the data to the client. Similarly, the Raft algorithm requires a majority of nodes to persist a log entry before it can be committed because the sets of nodes forming a majority in two different rounds of consensus must have a non-empty intersection—i.e., there is always at least one node that persisted both log entries. Dynamo further decouples this notion of "majority" to enhance write availability.
Most applications can use a common configuration: setting (N, R, W) to (3, 2, 2). To guarantee "always writable", one would configure it as (N, R, 1). The challenge here lies in setting R: if R is set to N, read performance and availability suffer; if set to less than N, data consistency cannot be guaranteed, potentially losing updates. Therefore, by appropriately configuring (N, R, W), developers can tailor the performance, availability, and consistency of data storage to their business needs. This indicates that Dynamo is not merely an AP database; it allows users to make informed trade-offs between AP and CP, and between read and write operations.
Comparison of Partitioning Strategies
I previously thought that consistent hashing and fixed partitioning strategies (like Redis's hash slots) each had their pros and cons as different partitioning approaches. However, the Dynamo paper provides a comparison, showing that industry consensus on this matter was established long ago.
Consistent hashing ensures load balancing in storage and performs quite well in request load balancing. However, it has the following drawbacks: First, when a new node joins, its adjacent nodes need to scan their local databases to determine which data to migrate. This scanning process is very resource-intensive and may affect the nodes' normal service. If resource usage for scanning is limited, it could lead to a new node taking a long time to acquire the necessary data before it can begin working. Second, data archiving is difficult because it requires scanning the entire data space, which is also resource-intensive.
Fixed partitioning strategies, on the other hand, divide the entire data space into a fixed number of partitions Q. Algorithms are used to distribute these Q partitions as evenly as possible across all nodes. Experiments show that this strategy is superior in request load balancing and also addresses the shortcomings of consistent hashing. Each partition can be stored as a separate file, allowing direct file transfer during data migration and archiving. This significantly reduces resource consumption and operational complexity.
Furthermore, since every node in a Dynamo cluster stores global partition information, the space occupied by fixed partition information ($O(Q)$) is three orders of magnitude smaller than that occupied by consistent hashing partition information ($O(ST)$, where $S$ is the number of nodes in the cluster and $T$ is the number of virtual nodes per node). This is yet another advantage.
References
- Dynamo: Amazon's Highly Available Key-value Store
- LLM