Bloom Filters - Michael's Notes

> See also: > - [[Hash Functions]] # Bloom Filters A Bloom Filter is a space-efficient probabilistic data structure that allows you to determine if a certain item potentially belongs in a set Adding an element never fails, however, the false positive rate will steadily increase as more are added until all bits within the filter are set to 1. At this point *all queries will yield a positive* (true or false) result. ![[Pasted image 20240305001718.png|600]] Bloom Filters have an extremely *fast runtime* and are *space-efficient*. In order to achieve this, a *static array* is used that retains its size regardless of the amount of values inserted ($n$). This component of a bloom filter is the **bit array**, which is a 1D array which stores bits (0 or 1), all of which are *initialized to 0*. - The bits within the bit array are shared with all items in the set. The bits are not unique for each item The number of indices and their values to be used when adding items are determined by the $k$ number of hash functions, each of which calculate the index ## Bloom Filter Operations Standard bloom filters *do not support deletion of elements*, as it could potentially disrupt the bit values of previously hashed values. ### Insertions ### Searching The search operation begins similarly to an insertion, where the hash values for the element being search for are calculated, and the corresponding indices of the bit array are accessed. If any of the values accessed within the bit array are 0, then we know that the value does not exist in the set. ## Probability of False Positives ### Probability of Outcomes | Type | Implication | | -------------- | -------------------------------------------------------- | | True Positive | If the item exists, and is present in the set | | False Positive | If the item exists, but is not present in the set | | True Negative | If the item does not exist and is not present in the set | | False Negative | N/A | Simply increasing the number of hash functions used ($k$) has several drawbacks: - While it might initially reduce the number of false positives Given a bloom filter with $m$ bits and $k$ hash functions, the possibility of a false positive occuring is: $p(E)=(1-(1-\frac{1}{m})^{nk})^k$