Aurora DB, quorums etc.

EBS seems to implement chain replication with just 2 copies. How does it recover if one of those servers died?
EBS volumes aren’t meant to be shared. While obvious once stated, I imagine this leads to some interesting design trade offs for EBS.
Problems with EBS if used for a DB:
- Too much network IO.
- Both replicas are stored in same AZ, so no AZ failure resiliency.
How to use a relational DB in cloud?
- One way is to just install it on EC2 with EBS. However, since EBS is a zonal service, the arrangement isn’t resilient to AZ failures.
- You could do RDS style: replicate the DB in 2 (or more?) AZs and the service manages the replication on customer behalf. However, MySQL writes pages of btree which are large (think 8 KB) and, therefore, slow. (It also writes entries to its write ahead log but those are small and unavoidable.)

Quorums:

N replicas, W and R are number of servers you wait for writes and reads respectively. Then, R+W = N+1. This ensures there is overlap of at least 1.
While reading, how to decide what’s the latest value? Majority doesn’t work because latest value could be with just 1 server. So, use versions. Versioning easy if all writes go through 1 server but must be hard if that’s not the case.
Tweak W and R to optimize for latency and fault tolerance.

Aurora:

6 replicas (2 per AZ), only transfer log entries and not full data pages (so less data transfer) and use quorums for replication.
- W=4 and R=3 and they chose these numbers based on their availability goals. So, if 3 replicas are down, writes can start only once a new replica is brought up from remaining 3.
- Writes can still happen if 1 AZ is down.
Database and storage servers are separate. I guess that has two advantages: allows both to scale independently and offer serverless to customers.
In normal operations, DB server knows what storage servers are up to date. So, can read from one of those without needing quorum. However, quorum needed during crash recovery of DB server.
Writes log entries across replicas but reads pages from btree or something.
10 GB partitions. So, if your database is huge, its data will be sharded in chunks of 10 GB across the fleet.
If a storage server dies, we need to replicate all partitions that were hosted on it. That happens in parallel across multiple storage servers in the fleet.
DB servers also replicated. Seems like writes are still managed by 1 (so that versioning happens correctly) but reads are replicated.

Open questions:

How to choose between leader election and quorums?
Unclear if “commit” transaction is a separate command. If it is, the system looks similar to Raft. But probably it isn’t. In regular operation, system might keep retrying until all replicas get a command. Only during crash recovery does it need to issue undo.