One of the last ongoing activities of Summer of Shipping is a book club, where we meet weekly to work through the excellent book - Designing Data Intensive Applications.
In my opinion, this book is the gold standard for technical writing. Martin Kleppmann commands an expert level understanding of data architectures, yet delivers that expertise in a way that any engineer can follow. This is a great accomplishment with a tremendous impact - it makes the field of databases and distributed systems "legible" for many.
While I love the book - it's still a book. The medium is the message and it's not meant to be glanced at for quick recall. Thus, with the help of the SoS cohort, we've provided an accompaniment. This effort is ongoing, but the goal here is to provide a glossary of "hard to remember" terms paired with "easy to grasp" explanations.
Chapter 5 - Replication
Single Leader Replication
- Semi-synchronous replication - A configuration of replicated databases where there is one synchronous follower, and all others are asynchronous. This means there is at least one guaranteed in sync follower, without a terrible performance penalty. It's a popular configuration
- Failover - the process of picking a new leader once a leader has failed. Complications may arise, such as the old leader rejoining after the new leader is chosen (in which case, the old leader's stale data is discarded, and starts following) and split brain (where multiple nodes think they're the leader. One of them eventually wins out)
Types of Replication Logs
- Statement Based replication - In this form of replication, statements are streamed to followers as is. It is conceptually easy and compact, but problems may arise when statements use things like rand() and now() and autoincrement.
- Write ahead log replication - this form of replication is done by sending the on-disk representation deltas to followers (recall that the write ahead log is used in crash recovery purposes). This may be a problem if upgrading database versions, but the greater principle is that by coupling the replication log with its representation, you can encounter difficulties (as you loose a degree of flexibility)
- Logical log (or row based) replication - describes the exact changes that are happening on each row. Avoids the problems of statement based (i.e. now and rand) and write ahead (coupling). Similar to write ahead log, this can result in a lot of data streamed to followers
- Trigger based replication - basically running some code to run some custom replication logic
Replication Lag
- Replication lag - the phenomenon where a follower is significantly behind the leader, such that someone who reads from the follower might observe a database that is in an old state.
- Read-after-write Consistency - when you write to a database you would expect the change to be there upon subsequent reads. This consistency makes that guarantee
- Monotonic Reads - If several replicas are utilized, we can encounter a situation where some replicas are more behind than others. In such cases, read requests from replica to replica will look like we’re jumping around in time (possibly back in time). Monotonic reads are a guarantee which prevent such jumping around in time
- Consistent Prefix Reads - If there are two requests A and B, and B depends upon A, then we want to make sure no one sees B, unless they can also see A. This is a causal relationship, and the term “consistent prefix reads” simply means this causal ordering is maintained.
Leaderless Replication
- Read repair - When a client reads a value from multiple nodes and detects a stale value from one of the nodes, they will then write back the correct value to the node that contained the wrong value.
- Anti-entropy process - a background daemon that runs on some datastores that will persistently look for differences in data between replicas and restore them to a consistent state.
- Quorum - a subset of a cluster that must be in agreement for a decision to be recognized. For Leaderless systems (the Dynamo family of databases), this is w + r > n
- w + r > n - this inequality is a rather sneaky way to guarantee fresh reads. It states that if there are n nodes in a cluster, then the number of nodes that must be written to is w, and the nodes that must be read from is r. If this inequality holds, then ever read will contain at least one of the fresh writes due to an inevitable overlap. Clever
- Last write wins - a very destructive way to handle concurrent writes, but is conceptually easy to remember as the name describes it perfectly: the last write to the DB should win. Used by Cassandra and optionally by Riak. (Note on Cassandra: Because Cassandra uses this strategy, it is recommended to make each write operation to Cassandra with a UUID and treat the entry as immutable. This is similar to a log, or event sourcing)
- "Happens Before" relationships - if we handle concurrent writes with care, we need to keep track of happens before relationships. This is typically done with version vectors
Chapter 6 - Partitioning
- Hot Spot - a partition receiving a unfairly high amount of load, e.g. it receives much more queries than other partitions. The presence of hot spots are antithetical to the goal of partitioning, which is to evenly distribute data and query load across a set of nodes.
- Secondary Index - an index that utilizes some other information other than the primary key. The keys in a secondary index are not necessarily unique, unlike a primary index.
- Document-partitioned Secondary Index - each partition maintains its own secondary index for its own data. Writing to the database in this case only requires that you interact with the partition containing the document you are writing to; however this complicates read queries as you must perform scatter/gather
- Scatter/gather - involves sending a particular query to all partitions (the scatter) and combining the returned results (the gather). This method can make read queries on secondary indices expensive.
- Term-partitioned Secondary Index - Instead of each partition maintaining its own secondary index, a global index is created covering the data in all partitions and is itself partitioned by term (or hash of the term). This makes reads more efficient, but writing to a document may affect different partitions of the global index.
A note on "rebalancing partitions"
Instead of thinking of this as rebalancing partitions, a better way is to think of mapping keys to partitions, and then partitions to nodes.
The idea here is to create the partition abstraction that is NOT 1-1 to nodes. This provides flexibility.
We can define partitions as ranges over the keys (if they’re naturally ranges) or over the hashes of keys (if we’re using hashed keys). Mod N is bad because changes to N cause a lot of partition changes. Once we’ve sliced out P partitions, we can assign them to the N nodes. Generally, P >> N and the assignment of the P partitions to the N nodes is kept in some consensus system. More capable machines can own more partitions than less capable ones and addition of new machines can “steal” partitions from other machines.
Partitions can be dynamically determined too. If there are more nodes, we can choose strategies (e.g. smaller ranges) so that there are more partitions and keep partitions small.
A note on routing to the correct partition
Supposing we have selected a partitioning scheme (e.g. hash or range keys), how things are routed depends on where the routing information can be:
- with the client
- with some intermediary
- or with the final destination
If the client has full awareness of the destinations, it can send requests directly to the correct destination. Sometimes, there are many clients or the clients don’t have full visibility into the destinations. An intermediary like a routing proxy could route to the right place. Otherwise, destinations can function as intermediaries when they receive requests destined for a partition that they’re not responsible for. Some systems are a hybrid of these routing strategies. In all cases, the router must be aware of the possible destinations or at least the next hop and that may require use of a membership system that is typically a consensus system.