How do you rebalance partitions (apart from the simple concern of whether to do so automatically or manually)? Conceptually, there are 4 ways:
“Hash % n”: not a good idea because, if changing n possibly requires moving data from a large number of nodes.
Fix the number of partitions upfront:
Similar to what Elasticsearch does. For e.g., let’s say you have 10 nodes, you may choose to create 1000 partitions.
Size of each partitions grows proportionally to data size.
You need to get the number of partitions just right but that’s hard. If you choose a very high number, you may end up with a lot of small size partitions, so high overhead of managing them. If you choose too small, moving them around might be hard.
Things complicate as your data size starts small but grows big over time.
Fix the size of every partition = dynamic partitioning:
Similar to how DynamoDB maintains an upper cap of 10 GB per partition.
Requires splitting or merging partitions as they grow or shrink in size.
Number of partitions grows proportionally to data size.
Pre-splitting is a good idea when you are just starting: otherwise, if you just start with a single partition and wait for it to split once its size hits the 10 GB or whatever threshold, only one node will be in use and others will sit idle.
Partitioning proportionally to nodes:
All 3 techniques above ignore the number of nodes. In this approach, however, we fix the number of partitions - say, P - per node.
Let’s say, at some point, there are X nodes in the system. So, size of all partitions on these X nodes will grow proportionally with data size. If we add Y nodes, the size of these partitions will go back down.
These Y nodes will, at random, steal half the partition data from P existing partitions.
If we pick partitions randomly, it might skew the data. But, averaged over time, data movement is uniform.
Request routing:
How are client requests routed to the right node for a given key?
Client could connect to a random node. That node calls the right node if it doesn’t have the data.
IIRC, this is how Elasticsearch behaves.
Client sends request to a load balancer style “request router”. This is how DynamoDB behaves.
Client itself knows what the right node is and directly sends request to it.
Still, how do you keep track of key to partition to node mapping?
Similar to “service discovery” problem: Zookeeper is a good solution.
What I find interesting is this: the ideas behind partitioning (and possibly other distributed system concepts) become conceptually straightforward once we introduce a low-level coordination service like Zookeeper.
If ZK weren’t present, I would try to design a partition-management-service that pushes this metadata to request-routers. I’ll also think of consistent-hashing to partition+replicate this metadata to further increase availability and fault tolerance.