Locality Sensitive Hashing Based Clustering Algorithm

Example(s):
- algorithm described in (Ozdem & Akcayol, 2021),
- algorithm described in (Jin & Han, 2017),
- …
Counter-Example(s):
- a Graph Clustering Algorithm,
- a Hash Algorithm,
- a Text Clustering Algorithm.
See: Hash Function, Hash Table, Nearest Neighbor Search, Distance Measure, Nearest Neighbor Search, Jaccard Similarity, Cosine Distance, Hamming Distance.

References

(Ozdem & Akcayol, 2021) ⇒ Kevser Ozdem, and M. Ali Akcayol "Locality Sensitive Hashing Based Clustering for Large Scale Documents". In: In 2021 6th International Conference on Mathematics and Artificial Intelligence (ICMAI 2021). In: DOI:10.1145/3460569.3460590.

(Jin & Han, 2017) ⇒ Xin Jin; Jiawei Han. (2011). "Locality Sensitive Hashing Based Clustering”. In: (Sammut & Webb, 2017). DOI:10.1007/978-1-4899-7687-1_950
- QUOTE: The basic idea of the LSH (Gionis et al. 1999) technique is using multiple hash functions to hash the data points and guarantee that there is a high probability of collision for points which are close to each other and low collision probability for dissimilar points. LSH schemes exist for many distance measures, such as Hamming norm, Lp norms, cosine distance, earth movers distance (EMD), and Jaccard coefficient.
  In LSH, define a family [math]\displaystyle{ H=\{h:S \to U\} }[/math] as locality-sensitive, if for any [math]\displaystyle{ a }[/math], the function [math]\displaystyle{ p(t)=P_{r_H}[h(a)=h(b):\parallel a -b \parallel = x }[/math] is decreasing in [math]\displaystyle{ x }[/math]. Based on this definition, the probability of collision of points [math]\displaystyle{ a }[/math] and [math]\displaystyle{ b }[/math] is decreasing with their distance.
  Although LSH was originally proposed for approximate nearest neighbor search in high dimensions, it can be used for clustering as well (Das et al. 2007; Haveliwala et al. 2000). The buckets could be used as the bases for clustering. Seeding the hash functions several times can help getting better quality clustering.

(Haveliwala et al., 2000) ⇒ Taher Haveliwala, Aristides Gionis, and Piotr Indyk (2000). "Scalable techniques for clustering the web". In: Proceedings of the third international workshop on the web and databases. Stanford University, Stanford, pp 129–134.

(Gionise et al., 1999) ⇒ Aristides Gionis, Piotr Indyk, and Rajeev Motwani (1999) "Similarity search in high dimensions via hashing". In: Proceedings of the 25th International Conference on very large data bases (VLDB’99). Morgan Kaufmann Publishers, San Francisco, pp 518–529