Locality-Sensitive Hashing (LSH) Algorithm
A Locality-Sensitive Hashing (LSH) Algorithm is an approximate similarity search algorithm that uses multiple "bucket" hash functions to guarantee that there is a high probability of collision for points which are close to each other and low collision probability for dissimilar points.
- Context:
- It computes Hash Function for mapping points in a multidimensional coordinate space to scalar values.
- It outputs a Hash Tables.
- It can (typically) be a Data-Independent Hashing Method.
- It can be specific for a Distance Measures, such as: Hamming norm, Lp norms, cosine distance, earth movers distance (EMD), and Jaccard coefficient.
- It can be used for High-Dimensional Data.
- …
- Example(s):
- Counter-Example(s):
- See: LSH Family, Nearest Neighbor Search, Jaccard similarity, Cosine Distance, Hamming Distance.
References
2020
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Locality-sensitive_hashing Retrieved:2020-5-5.
- In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets are much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate nearest neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as Locality-preserving hashing (LPH).
- In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets are much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
2017
- (Jin & Han, 2017) ⇒ Xin Jin, and Jiawei Han. (2011). "Locality Sensitive Hashing Based Clustering”. In: (Sammut & Webb, 2017). DOI:10.1007/978-1-4899-7687-1_950
- QUOTE: The basic idea of the LSH (Gionis et al. 1999) technique is using multiple hash functions to hash the data points and guarantee that there is a high probability of collision for points which are close to each other and low collision probability for dissimilar points. LSH schemes exist for many distance measures, such as Hamming norm, Lp norms, cosine distance, earth movers distance (EMD), and Jaccard coefficient.
In LSH, define a family [math]\displaystyle{ H=\{h:S \to U\} }[/math] as locality-sensitive, if for any [math]\displaystyle{ a }[/math], the function [math]\displaystyle{ p(t)=P_{r_H}[h(a)=h(b):\parallel a -b \parallel = x }[/math] is decreasing in [math]\displaystyle{ x }[/math]. Based on this definition, the probability of collision of points [math]\displaystyle{ a }[/math] and [math]\displaystyle{ b }[/math] is decreasing with their distance.
Although LSH was originally proposed for approximate nearest neighbor search in high dimensions, it can be used for clustering as well (Das et al. 2007; Haveliwala et al. 2000). The buckets could be used as the bases for clustering. Seeding the hash functions several times can help getting better quality clustering.
- QUOTE: The basic idea of the LSH (Gionis et al. 1999) technique is using multiple hash functions to hash the data points and guarantee that there is a high probability of collision for points which are close to each other and low collision probability for dissimilar points. LSH schemes exist for many distance measures, such as Hamming norm, Lp norms, cosine distance, earth movers distance (EMD), and Jaccard coefficient.
2016
- (Liu et al., 2016) ⇒ Haishan Liu, David Pardoe, Kun Liu, Manoj Thakur, Frank Cao, and Chongzhe Li. (2016). “Audience Expansion for Online Social Network Advertising.” In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 165-174.
- QUOTE: ... Given the large size of the problem, we employ a Locality Sensitive Hashing (LSH) technique, named Arcos (Charikar, 2002), to assist in finding members with high cosine similarity. Each member is mapped to one of $2^n$ clusters, where n is chosen to make our nearest neighbor search manageable. ...
2015
- (Gao et al., 2015) ⇒ Jinyang Gao, H.V. Jagadish, Beng Chin Ooi, and Sheng Wang. (2015). “Selective Hashing: Closing the Gap Between Radius Search and K-NN Search.” In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783284
- QUOTE: Locality Sensitive Hashing (LSH) [10] is an efficient approximate algorithm for high-dimensional similarity search. It is efficient and provides a rigorous quality guarantee for finding similar points within a radius [math]\displaystyle{ r }[/math]. We denote the sphere centered at point [math]\displaystyle{ q }[/math] by [math]\displaystyle{ B(q, r) }[/math]. Formally, [math]\displaystyle{ B(q, r) = \{p|p \in R^d , \parallel p, q\parallel \leq r\} }[/math]. The general idea of LSH is that objects that are within a given distance will be hashed to the same value with high probability, where the set of hash functions called LSH family:
DEFINITION 1. (LSH Family, H) A family H = {h : R^d \in U} of functions is called ([math]\displaystyle{ r, cr, p_1, p_2 }[/math])-sensitive if [math]\displaystyle{ \forall p, q \in R^d }[/math]:
- if [math]\displaystyle{ p \in B(q, r) }[/math], then [math]\displaystyle{ P_{r_H}(h(p) = h(q)) \geq p_1 }[/math];
- if [math]\displaystyle{ p \notin B(q, cr) }[/math], then [math]\displaystyle{ P_{rH}(h(p) = h(q)) \leq p_2 }[/math];
- QUOTE: Locality Sensitive Hashing (LSH) [10] is an efficient approximate algorithm for high-dimensional similarity search. It is efficient and provides a rigorous quality guarantee for finding similar points within a radius [math]\displaystyle{ r }[/math]. We denote the sphere centered at point [math]\displaystyle{ q }[/math] by [math]\displaystyle{ B(q, r) }[/math]. Formally, [math]\displaystyle{ B(q, r) = \{p|p \in R^d , \parallel p, q\parallel \leq r\} }[/math]. The general idea of LSH is that objects that are within a given distance will be hashed to the same value with high probability, where the set of hash functions called LSH family:
- LSH family for Euclidean distance is accomplished by random Gaussian projection. There are also various versions of LSH family for Jaccard similarity, cosine distance, Hamming distance etc.
LSH usually applies a concatenation of m hashing functions randomly selected from the hashing family to build the hash table(reducing check-rate), and builds l independent hash tables and takes the union of results (raising recall). m and l are parameters of LSH and the best value is related to the data size.
- LSH family for Euclidean distance is accomplished by random Gaussian projection. There are also various versions of LSH family for Jaccard similarity, cosine distance, Hamming distance etc.
2014
- (Leskovec et al., 2014) ⇒ Jure Leskovec, Anand Rajaraman, and Jeff Ullman (2014). "Mining of massive datasets". Cambridge university press.
2012
- (Har-Peled et al., 2012) ⇒ Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani (2012). "Approximate nearest neighbor: Towards removing the curse of dimensionality". Theory of computing, 8(1), 321-350.
2004
- (Datar et al., 2004) ⇒ Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. (2004). “Locality-sensitive Hashing Scheme based on P-stable Distributions.” In: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253-262 . ACM,
1999
- (Gionise et al., 1999) ⇒ Aristides Gionis, Piotr Indyk, and Rajeev Motwani (1999) "Similarity search in high dimensions via hashing". In: Proceedings of the 25th International Conference on very large data bases (VLDB’99). Morgan Kaufmann Publishers, San Francisco, pp 518–529