Skip to main content

Probabilistic data structures

·3 mins
  • Bloom filters.
    • Set membership.
    • 1D array with k hash functions.
    • False positives possible but no false negatives.
    • Applications:
      • One hit wonders can take up to 75% of cache space. BF can help identify so that you can skip caching such items and save cache space.
      • Check for weak passwords, malicious URLs, username etc. If you want 100% accuracy, check in BF first and fallback to the real data store as required.
      • In a NoSQL type database, check BF on whether an item exists before going to the disk. Therefore, decrease disk access.
  • Count min sketch.
    • Estimate frequency of all elements in a set.
    • 2D array. Each row is for a given hash function, so R hash functions. These functions give a number between 0 and C-1, where C is the number of columns.
    • Once all elements are added in the stream, how do you find one with the highest frequency?
      • If you have the superset of all elements, you can make another pass on it and keep track of the most frequent element.
      • A better solution is to keep a separate min heap.
        • While adding an element to the sketch, compare its latest frequency with the minimum element in the heap.
        • If new frequency is higher than minimum in the heap, remove the minimum and add the current element.
        • Remember: min heap helps to find top-k elements and vice versa.
    • Can over-estimate, never under-estimate.
    • Applications:
      • Top k elements in a stream. (For e.g., trending products, etc.)
  • Log log.
    • Cardinality estimation.
      • I was thinking one way to estimate cardinality would be to use a bloom filter in conjunction with a counter: when a new item comes, check if it already exists in the filter and, if it isn’t, increase counter by 1.
        • Should probably work but I could be missing some edge case(s).
    • Mental model:
      • Let’s say you hash items such that the binary representation of the result is an X bit number. (X is a design parameter - up to you.) If the hash function is truly random, the bits should be evenly distributed in the result.
      • If cardinality of the dataset is N, we expect N/2 items to be 1XXX…, N/4 to be 01XXX…, N/8 to be 001XXX… etc.
      • If the maximum position of 1 is k from the start, N/2^k is approximately 1. So, cardinality is approximately 2^k.
      • Randomness can cause issues with the calculation. Two ways to fix it.
        • Multiple hash functions:
          • Apply multiple hash functions, each will give some cardinality. Final result is average of all these.
          • Expensive.
        • Buckets with single hash function:
          • Use first Y bits (out of X) to bucketize the results. So, Y == 10 will result in 2^10 buckets.
          • Use the remaining X-Y bits per the original idea. So, for every bucket, you get some cardinality.
          • Average the bucket-level cardinality numbers and multiply by 2^Y to get the final result.
      • There is nothing special about choosing 00…1XX… (i.e. leading zeros). You are free to choose XX…100… (i.e. zeros at the end), 11…0XX… (i.e. leading 1s instead of 0s) etc.
    • Hyper log log is a variant on top of log-log that decreases error percentage. But the core idea is the same.