One hit wonders can take up to 75% of cache space. BF can help identify so that you can skip caching such items and save cache space.
Check for weak passwords, malicious URLs, username etc. If you want 100% accuracy, check in BF first and fallback to the real data store as required.
In a NoSQL type database, check BF on whether an item exists before going to the disk. Therefore, decrease disk access.
Count min sketch.
Estimate frequency of all elements in a set.
2D array. Each row is for a given hash function, so R hash functions. These functions give a number between 0 and C-1, where C is the number of columns.
Once all elements are added in the stream, how do you find one with the highest frequency?
If you have the superset of all elements, you can make another pass on it and keep track of the most frequent element.
A better solution is to keep a separate min heap.
While adding an element to the sketch, compare its latest frequency with the minimum element in the heap.
If new frequency is higher than minimum in the heap, remove the minimum and add the current element.
Remember: min heap helps to find top-k elements and vice versa.
Can over-estimate, never under-estimate.
Applications:
Top k elements in a stream. (For e.g., trending products, etc.)
Log log.
Cardinality estimation.
I was thinking one way to estimate cardinality would be to use a bloom filter in conjunction with a counter: when a new item comes, check if it already exists in the filter and, if it isn’t, increase counter by 1.
Should probably work but I could be missing some edge case(s).
Mental model:
Let’s say you hash items such that the binary representation of the result is an X bit number. (X is a design parameter - up to you.) If the hash function is truly random, the bits should be evenly distributed in the result.
If cardinality of the dataset is N, we expect N/2 items to be 1XXX…, N/4 to be 01XXX…, N/8 to be 001XXX… etc.
If the maximum position of 1 is k from the start, N/2^k is approximately 1. So, cardinality is approximately 2^k.
Randomness can cause issues with the calculation. Two ways to fix it.
Multiple hash functions:
Apply multiple hash functions, each will give some cardinality. Final result is average of all these.
Expensive.
Buckets with single hash function:
Use first Y bits (out of X) to bucketize the results. So, Y == 10 will result in 2^10 buckets.
Use the remaining X-Y bits per the original idea. So, for every bucket, you get some cardinality.
Average the bucket-level cardinality numbers and multiply by 2^Y to get the final result.
There is nothing special about choosing 00…1XX… (i.e. leading zeros). You are free to choose XX…100… (i.e. zeros at the end), 11…0XX… (i.e. leading 1s instead of 0s) etc.
Hyper log log is a variant on top of log-log that decreases error percentage. But the core idea is the same.