MinHash Hashing Scheme

From GM-RKB
Jump to navigation Jump to search

A MinHash Hashing Scheme is a locality sensitive hashing scheme that ...



References

2018

2016

  • (Ondov et al., 2016) ⇒ Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. (2016). “Mash: Fast Genome and Metagenome Distance Estimation Using MinHash.” Genome biology 17, no. 1
    • QUOTE: … The MinHash technique is a form of locality-sensitive hashing [5] that has been widely used for the detection of near-duplicate Web pages and images [6, 7], but has seen limited use in genomics despite initial applications over ten years ago [8]. More recently, MinHash has been applied to the relevant problems of genome assembly [9], 16S rDNA gene clustering [10, 11], and metagenomic sequence clustering [12]. Because of the extremely low memory and CPU requirements of this probabilistic approach, MinHash is well suited for data-intensive problems in genomics. To facilitate this, we have developed Mash for the flexible construction, manipulation, and comparison of MinHash sketches from genomic data. We build upon past applications of MinHash by deriving a new significance test to differentiate chance matches when searching a database, and derive a new distance metric, the Mash distance, which estimates the mutation rate between two sequences directly from their MinHash sketches. Similar “alignment-free” methods have a long history in bioinformatics [13, 14].

2008

2000

1997

  • (Broder, 1997) ⇒ Andrei Z. Broder. (1997). “On the Resemblance and Containment of Documents.” In: Compression and Complexity of Sequences
    • ABSTRACT: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using fingerprints.