MinHash Hashing Scheme

Example(s):
- as used in (Chum et al., 2008).
- as used in (Broder, 2000).
- …
Counter-Example(s):
- Bloom Filter.
See: Locality Sensitive Hashing, Cluster Analysis.

References

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/MinHash Retrieved:2018-3-12.
- In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broeder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.
  It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

(Ondov et al., 2016) ⇒ Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. (2016). “Mash: Fast Genome and Metagenome Distance Estimation Using MinHash.” Genome biology 17, no. 1
- QUOTE: … The MinHash technique is a form of locality-sensitive hashing [5] that has been widely used for the detection of near-duplicate Web pages and images [6, 7], but has seen limited use in genomics despite initial applications over ten years ago [8]. More recently, MinHash has been applied to the relevant problems of genome assembly [9], 16S rDNA gene clustering [10, 11], and metagenomic sequence clustering [12]. Because of the extremely low memory and CPU requirements of this probabilistic approach, MinHash is well suited for data-intensive problems in genomics. To facilitate this, we have developed Mash for the flexible construction, manipulation, and comparison of MinHash sketches from genomic data. We build upon past applications of MinHash by deriving a new significance test to differentiate chance matches when searching a database, and derive a new distance metric, the Mash distance, which estimates the mutation rate between two sequences directly from their MinHash sketches. Similar “alignment-free” methods have a long history in bioinformatics [13, 14].

(Chum et al., 2008) ⇒ Ondrej Chum, James Philbin, and Andrew Zisserman. (2008). “Near D Rabin (1981)uplicate Image Detection: Min-Hash and Tf-idf Weighting.” In: BMVC, vol. 810, pp. 812-815.

(Broder, 2000) ⇒ Andrei Z. Broder. (2000). “Identifying and Filtering Near-duplicate Documents.” In: Annual Symposium on Combinatorial Pattern Matching, pp. 1-10 . Springer, Berlin, Heidelberg,
(Broder et al., 2000b) ⇒ Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. (2000). “Min-wise Independent Permutations.” Journal of Computer and System Sciences 60, no. 3

(Broder, 1997) ⇒ Andrei Z. Broder. (1997). “On the Resemblance and Containment of Documents.” In: Compression and Complexity of Sequences
- ABSTRACT: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using fingerprints.