Partitioning

How do you rebalance partitions (apart from the simple concern of whether to do so automatically or manually)? Conceptually, there are 4 ways:
- “Hash % n”: not a good idea because, if changing n possibly requires moving data from a large number of nodes.
- Fix the number of partitions upfront:
  - Similar to what Elasticsearch does. For e.g., let’s say you have 10 nodes, you may choose to create 1000 partitions.
  - Size of each partitions grows proportionally to data size.
  - You need to get the number of partitions just right but that’s hard. If you choose a very high number, you may end up with a lot of small size partitions, so high overhead of managing them. If you choose too small, moving them around might be hard.
    - Things complicate as your data size starts small but grows big over time.
- Fix the size of every partition = dynamic partitioning:
  - Similar to how DynamoDB maintains an upper cap of 10 GB per partition.
  - Requires splitting or merging partitions as they grow or shrink in size.
  - Number of partitions grows proportionally to data size.
  - Pre-splitting is a good idea when you are just starting: otherwise, if you just start with a single partition and wait for it to split once its size hits the 10 GB or whatever threshold, only one node will be in use and others will sit idle.
- Partitioning proportionally to nodes:
  - All 3 techniques above ignore the number of nodes. In this approach, however, we fix the number of partitions - say, P - per node.
  - Let’s say, at some point, there are X nodes in the system. So, size of all partitions on these X nodes will grow proportionally with data size. If we add Y nodes, the size of these partitions will go back down.
    - These Y nodes will, at random, steal half the partition data from P existing partitions.
    - If we pick partitions randomly, it might skew the data. But, averaged over time, data movement is uniform.
Request routing:
- How are client requests routed to the right node for a given key?
  - Client could connect to a random node. That node calls the right node if it doesn’t have the data.
    - IIRC, this is how Elasticsearch behaves.
  - Client sends request to a load balancer style “request router”. This is how DynamoDB behaves.
  - Client itself knows what the right node is and directly sends request to it.
- Still, how do you keep track of key to partition to node mapping?
  - Similar to “service discovery” problem: Zookeeper is a good solution.
    - What I find interesting is this: the ideas behind partitioning (and possibly other distributed system concepts) become conceptually straightforward once we introduce a low-level coordination service like Zookeeper.
  - If ZK weren’t present, I would try to design a partition-management-service that pushes this metadata to request-routers. I’ll also think of consistent-hashing to partition+replicate this metadata to further increase availability and fault tolerance.