Probabilistic data structures

January 21, 2023

Bloom filters.
- Set membership.
- 1D array with k hash functions.
- False positives possible but no false negatives.
- Applications:
  - One hit wonders can take up to 75% of cache space. BF can help identify so that you can skip caching such items and save cache space.
  - Check for weak passwords, malicious URLs, username etc. If you want 100% accuracy, check in BF first and fallback to the real data store as required.
  - In a NoSQL type database, check BF on whether an item exists before going to the disk. Therefore, decrease disk access.
Count min sketch.
- Estimate frequency of all elements in a set.
- 2D array. Each row is for a given hash function, so R hash functions. These functions give a number between 0 and C-1, where C is the number of columns.
- Once all elements are added in the stream, how do you find one with the highest frequency?
  - If you have the superset of all elements, you can make another pass on it and keep track of the most frequent element.
  - A better solution is to keep a separate min heap.
    - While adding an element to the sketch, compare its latest frequency with the minimum element in the heap.
    - If new frequency is higher than minimum in the heap, remove the minimum and add the current element.
    - Remember: min heap helps to find top-k elements and vice versa.
- Can over-estimate, never under-estimate.
- Applications:
  - Top k elements in a stream. (For e.g., trending products, etc.)
Log log.
- Cardinality estimation.
  - I was thinking one way to estimate cardinality would be to use a bloom filter in conjunction with a counter: when a new item comes, check if it already exists in the filter and, if it isn’t, increase counter by 1.
    - Should probably work but I could be missing some edge case(s).
- Mental model:
  - Let’s say you hash items such that the binary representation of the result is an X bit number. (X is a design parameter - up to you.) If the hash function is truly random, the bits should be evenly distributed in the result.
  - If cardinality of the dataset is N, we expect N/2 items to be 1XXX…, N/4 to be 01XXX…, N/8 to be 001XXX… etc.
  - If the maximum position of 1 is k from the start, N/2^k is approximately 1. So, cardinality is approximately 2^k.
  - Randomness can cause issues with the calculation. Two ways to fix it.
    - Multiple hash functions:
      - Apply multiple hash functions, each will give some cardinality. Final result is average of all these.
      - Expensive.
    - Buckets with single hash function:
      - Use first Y bits (out of X) to bucketize the results. So, Y == 10 will result in 2^10 buckets.
      - Use the remaining X-Y bits per the original idea. So, for every bucket, you get some cardinality.
      - Average the bucket-level cardinality numbers and multiply by 2^Y to get the final result.
  - There is nothing special about choosing 00…1XX… (i.e. leading zeros). You are free to choose XX…100… (i.e. zeros at the end), 11…0XX… (i.e. leading 1s instead of 0s) etc.
- Hyper log log is a variant on top of log-log that decreases error percentage. But the core idea is the same.