Probabilistic data structures
·3 mins
- Bloom filters.
- Set membership.
- 1D array with k hash functions.
- False positives possible but no false negatives.
- Applications:
- One hit wonders can take up to 75% of cache space. BF can help identify so that you can skip caching such items and save cache space.
- Check for weak passwords, malicious URLs, username etc. If you want 100% accuracy, check in BF first and fallback to the real data store as required.
- In a NoSQL type database, check BF on whether an item exists before going to the disk. Therefore, decrease disk access.
- Count min sketch.
- Estimate frequency of all elements in a set.
- 2D array. Each row is for a given hash function, so R hash functions. These functions give a number between 0 and C-1, where C is the number of columns.
- Once all elements are added in the stream, how do you find one with the highest frequency?
- If you have the superset of all elements, you can make another pass on it and keep track of the most frequent element.
- A better solution is to keep a separate min heap.
- While adding an element to the sketch, compare its latest frequency with the minimum element in the heap.
- If new frequency is higher than minimum in the heap, remove the minimum and add the current element.
- Remember: min heap helps to find top-k elements and vice versa.
- Can over-estimate, never under-estimate.
- Applications:
- Top k elements in a stream. (For e.g., trending products, etc.)
- Log log.
- Cardinality estimation.
- I was thinking one way to estimate cardinality would be to use a bloom filter in conjunction with a counter: when a new item comes, check if it already exists in the filter and, if it isn’t, increase counter by 1.
- Should probably work but I could be missing some edge case(s).
- I was thinking one way to estimate cardinality would be to use a bloom filter in conjunction with a counter: when a new item comes, check if it already exists in the filter and, if it isn’t, increase counter by 1.
- Mental model:
- Let’s say you hash items such that the binary representation of the result is an X bit number. (X is a design parameter - up to you.) If the hash function is truly random, the bits should be evenly distributed in the result.
- If cardinality of the dataset is N, we expect N/2 items to be 1XXX…, N/4 to be 01XXX…, N/8 to be 001XXX… etc.
- If the maximum position of 1 is k from the start, N/2^k is approximately 1. So, cardinality is approximately 2^k.
- Randomness can cause issues with the calculation. Two ways to fix it.
- Multiple hash functions:
- Apply multiple hash functions, each will give some cardinality. Final result is average of all these.
- Expensive.
- Buckets with single hash function:
- Use first Y bits (out of X) to bucketize the results. So, Y == 10 will result in 2^10 buckets.
- Use the remaining X-Y bits per the original idea. So, for every bucket, you get some cardinality.
- Average the bucket-level cardinality numbers and multiply by 2^Y to get the final result.
- Multiple hash functions:
- There is nothing special about choosing 00…1XX… (i.e. leading zeros). You are free to choose XX…100… (i.e. zeros at the end), 11…0XX… (i.e. leading 1s instead of 0s) etc.
- Hyper log log is a variant on top of log-log that decreases error percentage. But the core idea is the same.
- Cardinality estimation.