Bloom Filter
A Bloom Filter is a space-efficient probabilistic set membership test classifier that uses a Bloom filter data structure (based on a bit vector and hash functions).
- Context:
- It can (typically) not make a False Negative Error.
- It can (typically) not support a Remove Record Operation.
- It can make a False Positive Error.
- It can be a Compressed Bloom Filter.
- It can be a Weighted Bloom Filter.
- Example(s):
- The Bloom Filter at http://billmill.org/bloomfilter-tutorial/
- The Bloom Filter at http://www.jasondavies.com/bloomfilter/
- Counter-Example(s):
- See: Probabilistic DS, Feature Vector Hashing Task, Murmur Hash.
References
2017
- https://cacm.acm.org/magazines/2017/9/220427-data-sketching/fulltext
- QUOTE: The Bloom filter is a compact data structure that summarizes a set of items. Any computer science data-structures class is littered with examples of "dictionary" data structures, such as arrays, linked lists, hash tables, and many esoteric variants of balanced tree structures. The common feature of these structures is that they can all answer "membership questions" of the form: Is a certain item stored in the structure or not? The Bloom filter can also respond to such membership questions. The answers given by the structure, however, are either "the item has definitely not been stored" or "the item has probably been stored." This introduction of uncertainty over the state of an item (it might be thought of as introducing potential false positives) allows the filter to use an amount of space that is much smaller than its exact relatives. The filter also does not allow listing the items that have been placed into it. Instead, you can pose membership questions only for specific items.
2013
- http://en.wikipedia.org/wiki/Bloom_filter
- A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, of which 90% could be hyphenated by following simple rules but all the remaining 50,000 words required expensive disk access to retrieve their specific patterns. With unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk access. But if core memory was insufficient, a smaller hash area could be used to eliminate most of the unnecessary access. For example, a hash area only 15% of the error-free size would still eliminate 85% of the disk accesses (Template:Harvtxt).
More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set (Template:Harvtxt).
- A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.
2006
- (Chang et al., 2006) ⇒ Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. (2006). “Bigtable: A Distributed Storage System for Structured Data.” In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation.
- QUOTE: As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters [7] should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.
2004
- http://search.cpan.org/dist/Bloom-Filter/Filter.pm
- A Bloom filter is a probabilistic algorithm for doing existence tests in less memory than a full list of keys would require. The tradeoff to using Bloom filters is a certain configurable risk of false positives. This module implements a simple Bloom filter with configurable capacity and false positive rate
- http://www.perl.com/pub/2004/04/08/bloom_filters.html
- Many people don't realize that there is an elegant alternative to the lookup hash, in the form of a venerable algorithm called a Bloom filter. Bloom filters allow you to perform membership tests in just a fraction of the memory you'd need to store a full list of keys, so you can avoid the performance hit of having to use a disk or database to do your lookups. As you might suspect, the savings in space comes at a price: you run an adjustable risk of false positives, and you can't remove a key from a filter once you've added it in. But in the many cases where those constraints are acceptable, a Bloom filter can make a useful tool.
1970
- (Bloom, 1970) ⇒ Burton H. Bloom. (1970). “Space/time Trade-offs in Hash Coding with Allowable Errors.” In: Communications of the ACM Journal, 13(7). doi:10.1145/362686.362692