Today I Learned

Interesting bits from Fundamentals of Software Architecture

Architects should question axioms from the past - the software world changes fast and some of those axioms may no longer be true. Hopefully, the book will give examples. In the past, I used to worry about not knowing non-AWS (for example, Kubernetes) or older technologies deeply enough. That’s actually okay. I don’t have anything more because I’ve dropped the book for now.

CRDTs and Operational Transform

CRDTs and Operational Transform algorithms both try to achieve the same purpose: how to support Google Doc style collaborative edits? OTs require a central server whereas CRDTs allow decentralization. The latter is important in cases where you just want to sync your mobile phone with computer when they are offline: Google Doc doesn’t support that. Automerge, a library implementation of CRDTs, supports a neat algorithm to ensure convergence: conflicts are unavoidable in some situations but, at the minimum, all parties should converge to one version. Only supports data structures: network transport is a separate concern and can vary. I think you only transport operations within participants, not the state. Participants arrive at common state when they apply their own and incoming operations. The order in which operations are applied doesn’t matter: end state will be consistent.

Zookeeper and chain replication

Zookeeper: One of the most interesting parts of ZK is how they designed APIs for a system to be a general-purpose coordination service. It supports hierarchy in its key structure, similar to how filesystems work. Two benefits: This hierarchy allows namespacing so that multiple unrelated systems can use a single ZK instance without stepping onto each other’s toes. Probably allows fall back. For e.g., if you don’t find config for a node, keep on going to its parents until you find what you are looking for. However, the lecture doesn’t talk about it so this is just a hunch. Mostly CRUD operations but with an interesting extra trick: watch points. For e.g., the “isExists” API allows you to atomically set a “watch” if the file you are looking for doesn’t exist. Watch notifications are sent out in order they were set (or something). How to implement locks using ZK: One way is to let all clients create a file and add a watch if file already exists. Whoever is able to create the file gets the lock and the others wait. When the file gets deleted (either explicitly by the client who created it, when it no longer needs the lock, or implicitly if that client dies), all remaining clients try again and the cycle repeats. This suffers from herd problem though. We can avoid the herd problem by letting all clients “get in line” and wait for their turn: they all get a sequence-id, similar to a file-version, for a particular file. A client then acquires a lock only once all clients that came before it have released their file versions. You can also do optimistic concurrency control through ZK. Good for storing small amounts of configuration/state: for e.g., all clients that make up my service, who all are alive etc. Chain replication: Seems like an interesting idea: you have a chain of servers, writes go to head and replicated one by one through the chain to the tail, and reads are served from the tail. Good parts: Fully linearizable, so that’s interesting. Also, failure handling (not necessarily failure detection) somewhat straightforward because you can just remove the culprit host. Head has to do less than work than that required by a Raft leader. (The latter has to listen to write requests from clients and talk to 2 or 4 replicas.) Issues: Split brain possible: for e.g., if there is a network partition between head and head->next nodes, both of them may think they are leaders. So, you need an external system such as Zookeeper to manage chain participants. Whenever ZK detects failures, it updates the chain to add or remove nodes. This is still hard. For e.g., what happens if head and head->next can both talk to ZK but not to each other? ZK might think they are both up and leave it be. If one server in the chain goes slow, it affects the entire chain. Raft, instead, can continue to make progress if one server is slow. CRAQ seems to be an improvement on chain replication but don’t think the lecture explained how.

Linearizability, Zookeeper etc.

Linearizability: Defined on operation history as observed by clients, not something defined for system design or something. Despite the fact that a distributed database could have multiple copies of data (because of, say, replication or caching), it should behave as if it only had one copy if it were to be linearizable. So, operations on the data will have a total order. Same as “strong consistency”. Forbids us to serve reads from replicas - can only do from leader, otherwise clients may receive stale data. Kind of obvious. Zookeeper: Full-blown system as opposed to Raft-like library. Not fully linearizable because reads are allowed to go to replicas which may be stale. Guarantees: Writes are globally linearizable. I guess this is similar to how an atomic variable behaves in Java in that concurrent updates to it still result in an accurate value. All read or write requests from a single client are linearizable, irrespective of whether those go to different machines or whether some writes are sent asynchronously. ZK ensures this through zx-id which it sends back for every request. Seems like zx-id is similar to latest-committed-log-index in Raft. Different clients may still see stale data (because their zx-id might be from a stale replica), hence the system overall isn’t linearizable. So, no global order. This is how Zookeeper speeds up reads by allowing replicas to handle them. Otherwise, a fully linearizable system has to use a leader for reads and so won’t scale. If a client wants to read global latest data, it can apply a neat trick: send a dummy “write” request, called a “sync”, and then a read with that sync’s zx-id. I was thinking of a way to send the read request to the leader but this approach seems better.

Rate limiting

Anycast is a neat trick: the IP address you vend out to a client remains constant but their requests go to a datacenter closest to them. That way, you need to do rate-limiting for them within a datacenter and not across. The latter would be hard because you’d have to replicate data in real time. There are various algorithms for local rate limiting. Good ones seem to be token bucket, leaky bucket and sliding window. Factors to evaluate these algorithms: do they allow bursts and the size of parameters you need to track per client-id. Distributed rate limiting: Gossip between hosts vs hosts using a single distributed cache: the difference probably is just that we don’t have to deal with the former ourselves when we use the latter. In other words, we offload replication to latter instead of managing ourselves in former. One issue with distributed cache is that we need to ensure it scales with the service fleet size, so that we don’t have a large fleet overwhelming a small fleet. I find gossip difficult to visualize. Also, it has lot of chattiness. An alternative to gossip is to elect a coordinator/leader that reads data from all the hosts and updates them back with updated numbers. Such leader election may not have to be super-accurate. Consistent hashing may be a good idea to map key to host which stores the counters. So, service hosts know which host, out of a cluster of rate trackers, they have to go for a given client id. Probably think of consistent hashing whenever you want to deal with data placement.

Transactions, ACID etc.

D = durability. Easiest to understand. C = consistency, i.e. application-level constraints or invariants are met. Property of the application that’s using the DB and not the DB itself. So, kind of useless. A = atomicity or, better explained as, abortability. Either all commands in a transaction go through or none of them. I = isolation is the trickiest one. Concurrent operations in a DB can lead to various race conditions. Dirty reads: read something that’s not yet committed. Dirty writes: transaction T1 writes on top of partial writes, which aren’t yet committed, by another transaction T2. Read skew: Let’s say a long-running client reads a lot of records from a DB. If transactions keep changing records during its run, the state of the DB the client captures may not be consistent. Write skew: Let’s say a transaction checks for a premise and then writes based on that. What happens if the premise is no longer valid by the time the write happens? DB isolation features: Read committed: doesn’t allow dirty reads or writes. Snapshot isolation or repeatable reads: no read skews. Need to track transaction level information per record to implement this. Serializable: no write skews. Strongest level but isn’t cheap to implement. So, not necessarily the default amongst all the DBs. Algorithms to implement: Literally serialize on the DB. Can work if all transactions are guaranteed to be fast. 2PL: lock all rows that match the premise before updating them. So, pessimistic locking. SSI, i.e. snapshot serializable isolation: Allow all transactions to run concurrently but don’t let them commit if they’d introduce a conflict. So, optimistic. 2-phase commit and 2-phase locking, i.e. 2PC and 2PL are different things. You use 2PC to support transactions across multiple systems. (I still need to learn about those in more detail.) Open questions: ...

Today I learned - Raft

The problems of distributed state machines and distributed logs are essentially the same. 3 roles: leader, follower and candidate. Only 2 RPC APIs in the algorithm. RequestVotes and AppendEntries (which also serves as heartbeats from leader to followers). Terms are tied to leadership changes, not with time. There can be terms that don’t have any leader because of election failures. Log indices change with commands from clients. So, multiple indices in sequence could have the same term. Leader is always right. So, it can change old log entries on other nodes. However, during election, the algorithm tries to elect a leader that is most up to date. Invariant: if, at an index, term number is the same between two nodes, everything before it should also be so. If nodes find discrepancies during replication, they reject the AppendEntries call from the leader and it’s the latter’s responsibility to retry for smaller indices. So, just those two APIs in the system. (Followers don’t make any API calls to resolve discrepancies.) During election, node will vote for a candidate only if the latter has a more complete log, which is defined by: 1) higher term and 2) higher log index, in sequence. 2 different majorities will contain at least 1 common member. While obvious, this has an important implication. If 2 different leaders are elected in sequence, at least 1 node will be common between those that voted for them. Open questions: ...

Today I learned - consensus algorithms

Paxos Paxos is notorious for being difficult to understand. However, the vanilla version seems somewhat straightforward albeit with one limitation: it only allows you to reach consensus on one value. For practical purposes, you want to reach consensus for multiple values in sequence, similar to a replicated log, and you need multi-Paxos for that. And that is hard. Think of basic Paxos as: “do we know already where we’ll go for dinner” and, if not, “let’s go for burgers”. Three roles: proposers, acceptors and learners. Odd number of acceptors. While both proposers and acceptors use numbers in the algorithm, those are just mechanisms to reach consensus on one value. Once consensus is reached, they can continue using more numbers but that value will never change. So, one run of the algorithm gets you one consensus value. If you want more, use multi-Paxos. The algorithm, the way I understand it, is as follows: ...

Today I learned

I realized an important thing while reading Remote Inc. book. I should focus on impact or outcome, not other small stuff such as responding to Slack, number of hours worked etc. No one remembers the latter after awhile, only impact matters. Not so much as I learned but rather remembered: perfectionism sucks. After a point, people who evaluate my work won’t notice the tiny details that make the work perfect (in my mind). In many cases, getting the work done and creating impact is more important than doing it perfectly.

Today I learned

PyTorch: You can initialize parameters through the nn.init module. Weight decay parameter is supplied to the optimizer (i.e. torch.optim module), not the loss function.