1998 ApproximateNearestNeighbors

(Indyk & Motwani, 1998) ⇒ Piotr Indyk, and Rajeev Motwani. (1998). “Approximate Nearest Neighbors: Towards removing the curse of dimensionality.” In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. doi:10.1145/276698.276876

Subject Headings: Nearest Neighbor Search Task, Curse of Dimensionality.

Notes

Cited By

~1078 http://scholar.google.com/scholar?q=%22Approximate+Nearest+Neighbors%3A+Towards+removing+the+curse+of+dimensionality%22+1998

Quotes

Abstract

The nearest neighbor problem is the following: Given a set of [math]\displaystyle{ n }[/math] points P={p_1,...,p_n} in some metric space [math]\displaystyle{ X }[/math], preprocess P/ so as to efficiently answer queries which require finding the point in [math]\displaystyle{ P }[/math] closest to a query point q ∈ X. We focus on the particularly interesting case of the [math]\displaystyle{ d }[/math]-dimensional Euclidean space where X = R^d under some l_p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large [math]\displaystyle{ d }[/math], in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p ∈ P that is an e-approximate nearest neighbor of the query [math]\displaystyle{ q }[/math] in that for all p’ ∈ P, d(p, q) < (1 + e)d(p’, q).

We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in [math]\displaystyle{ n }[/math] and [math]\displaystyle{ d }[/math], and a truly sublinear query time (for [math]\displaystyle{ e }[/math] > 1); and, (b) query time polynomial in log-n and [math]\displaystyle{ d }[/math], and only a mildly exponential preprocessing cost Õ(n) × O(1/e^d). Further, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in [math]\displaystyle{ d }[/math] and log n. Unfortunately, for small [math]\displaystyle{ e }[/math], the latter is a purely theoretical result since the exponent depends on l/e. Experimental results indicate that our first algorithm offers orders of magnitude improvement on running times over real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; here, we give applications to information retrieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms.

References

1 Pankaj K. Agarwal, Jirí Matoušek, Ray shooting and parametric search, Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, p.517-526, May 04-06, 1992, Victoria, British Columbia, Canada doi:10.1145/129712.129763
2 Sunil Arya, David M. Mount, Approximate nearest neighbor queries in fixed dimensions, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.271-280, January 25-27, 1993, Austin, Texas, United States
3 13, Arya, D,M, Mount, and O. Narayan, Accounting for boundary effects in nearest-neighbor searching. Discrete and Computational Geometry, 16(1996):155-176.
4 Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Wu, An optimal algorithm for approximate nearest neighbor searching, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.573-582, January 23-25, 1994, Arlington, Virginia, United States
5 A. Andersson, Static Dictionaries on RAMs: Query Time is Necessary and Sufficient, Proceedings of the 37th Annual Symposium on Foundations of Computer Science, p.441, October 14-16, 1996
6 Jon Louis Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM, v.18 n.9, p.509-517, Sept. 1975 doi:10.1145/361002.361007
7 Marshall Bern, Approximate closest-point queries in high dimensions, Information Processing Letters, v.45 n.2, p.95-99, Feb. 26, 1993 doi:10.1016/0020-0190(93)90222-U
8 M. W. Berry, S. T. D %A, A. T. Shippy, A Case Study of Latent Semantic Indexing, University of Tennessee, Knoxville, TN, 1995
9 Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth International Conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
10 C. Buddey, A. Singhal, M. Mitra, and Gerard M. Salton. New Retrieval Approaches Using SMART: TREC 4. In: Pro. ceedings of the Fourth Text Retrieval Conference, National Institute of Standards and Technology, 1995.
11 W. A. Burkhard, R. M. Keller, Some approaches to best-match file searching, Communications of the ACM, v.16 n.4, p.230-236, April 1973 doi:10.1145/362003.362025
12 Tolga Bozkaya, Meral Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces, Proceedings of the 1997 ACM SIGMOD Conference, p.357-368, May 11-15, 1997, Tucson, Arizona, United States
13 F. Cazals, Effective nearest neighbors searching on the hyper-cube, with applications to molecular clustering, Proceedings of the fourteenth annual symposium on Computational geometry, p.222-230, June 07-10, 1998, Minneapolis, Minnesota, United States doi:10.1145/276884.276910
14 Timothy M. Chan, Approximate nearest neighbor queries revisited, Proceedings of the thirteenth annual symposium on Computational geometry, p.352-358, June 04-06, 1997, Nice, France doi:10.1145/262839.263001
15 Kenneth L. Clarkson. (1988). “A Randomized Algorithm for Closest-Point Queries.” In: SIAM Journal on Computing, 17(4). doi:10.1137/0217052
16 Kenneth L. Clarkson. (1994). “An Algorithm for Approximate Closest-Point Queries.” In: Proceedings of the tenth annual symposium on Computational Geometry. doi:10.1145/177424.177609.
17 Kenneth L. Clarkson. (1997). “Nearest neighbor queries in metric spaces.” In: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. doi:10.1145/258533.258655
18 Scott Cost, Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features, Machine Learning, v.10 n.1, p.57-78, Jan. 1993 doi:10.1023/A:1022664626993
19 T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. {EEE Transactions on Information The. ory, 13(1967):21-27.
20 $. Deerwester, S. T. Dumais, T.K. Landaner, G.W. Furhas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society fo~ Information Sci. ences, 41(1990):391-407.
21 L. Devroye and T.J. Wagner, Nearest neighbor methods in discrimination. In: Handbook of Statistics, vol. 2, P.R. Krishnaiah and L.N. Kanal, eds., North-Holland, 1982.
22 D. Dobkin and R. Lipton. Multidimensional search problems. SIAM Journal on Computing, 5(1976):181- 186.
23 Danny Dolev, Yuval Harari, Nathan Linial, Noam Nisan, Michal Parnas, Neighborhood preserving hashing and approximate queries, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.251-259, January 23-25, 1994, Arlington, Virginia, United States
24 D. Dolev, Y. Harari, and M. Parnas. Finding the neighborhood of a query in a dictionary. In: Proceedings of the ~nd Israel Symposium on Theory and Computing Systems, 1993, pp. 33-42.
25 Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000
26 Herbert Edelsbrunner. (1987). “Algorithms in Combinatorial Geometry. Springer-Verlag
27 David Eppstein, Fast hiearchical clustering and other applications of dynamic closet pairs, Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, p.619-628, January 25-27, 1998, San Francisco, California, United States
28 Christos Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994 doi:10.1007/BF00962238
29 W. Feller. An introduction to Probability Theory and its Applications. John Wiley & Sons, NY, 1991.
30 Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker, Query by Image and Video Content: The QBIC System, Computer, v.28 n.9, p.23-32, September 1995 doi:10.1109/2.410146
31 William B. Frakes, Ricardo Baeza-Yates, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
32 P. Frankl, H. Maehara, The Johnson-Lindenstrauss Lemma and the sphericity of some graphs, Journal of Combinatorial Theory Series A, v.44 n.3, p.355-362, June 1987
33 Michael L. Fredman, János Komlós, Endre Szemerédi, Storing a Sparse Table with 0(1) Worst Case Access Time, Journal of the ACM (JACM), v.31 n.3, p.538-544, July 1984 doi:10.1145/828.1884
34 Jerome H. Freidman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software (TOMS), v.3 n.3, p.209-226, Sept. 1977 doi:10.1145/355744.355745
35 Allen Gersho, Robert M. Gray, Vector quantization and signal compression, Kluwer Academic Publishers, Norwell, MA, 1991
36 A. Gionis, P. Indyk, and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. Manuscript, 1997.
37 D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. In: Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 722-731.
38 Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. In: First InternaSonal Conference on Knowledge Discovery f.4 Data Mining, 1995, pp. 142-149.
39 H. HoteUing. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 27( 1933):417-441.
40 P. Indyk, Rajeev Motwani, and S. Venkatasubramanian. Geometric Matching Under Noise- Combinatorial Bounds and Algorithms. Manuscript, 1997.
41 W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26(1984):189-206.
42 W.B. Jotmson and G. Schechtman. Embedding l~~ into l{'. Acta Mathematica, 149(1982):71-85.
43 K. Karhunen. Uber lineare Methoden in dew Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae, Set. A137, 1947.
44 V. Koivune and S. Kassam. Nearest neighbor filters for multivariate data. IEEE l'~rkshop on Nonlinear Signal and Image Processing, 1995.
45 Jon Kleinberg, Two algorithms for nearest-neighbor search in high dimensions, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.599-608, May 04-06, 1997, El Paso, Texas, United States doi:10.1145/258533.258653
46 Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.614-623, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276877
47 R. M. Karp, O. Waarts, G. Zweig, The bit vector intersection problem, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.621, October 23-25, 1995
48 N. Linial, E. London, and Y. Rabinovich. The geomctry of graphs and some of its algorithmic appllcation~. In: Proceedings of 85th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 577-591.
49 M. Loire. Fonctions aleastoires de second ordere. Processus Stochastiques et mouvcment Brownian, Hermann, Paris, 1948.
50 Jirí Matoušek, Reporting points in halfspaces, Computational Geometry: Theory and Applications, v.2 n.3, p.169-186, Nov. 1992 doi:10.1016/0925-7721(92)90006-E
51 S. Meiser, Point location in arrangements of hyperplanes, Information and Computation, v.106 n.2, p.286-303, Oct. 1993 doi:10.1006/inco.1993.1057
52 Nimrod Megiddo, Applying Parallel Computation Algorithms in the Design of Serial Algorithms, Journal of the ACM (JACM), v.30 n.4, p.852-865, Oct. 1983 doi:10.1145/2157.322410
53 M. Minsky and S. Papert. Perceptrona. MIT Prcs~, Cambridge, MA, 1969.
54 Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
55 A. Pentland, R.W. Picard, and S. Sdaroff. Photobook: tools for content-based manipulation of image databases. In: Proceedings of the SPiE Confercncc on Storage and Retrieval of Image and Vidco Databases II, 1994.
56 G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Universit4' Press, 1989.
57 Gerard M. Salton, Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986
58 Hanan Samet, The design and analysis of spatial data structures, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1990
59 Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos, Multidimensional Access Methods: Trees Have Grown Everywhere, Proceedings of the 23rd International Conference on Very Large Data Bases, p.13-14, August 25-29, 1997
60 A.W.M. Smeulders and R. JaJn, eds. Image Databaaca and Multi-media Search. Proceedings of the First International Workshop, IDB-MIvIS '96, Amsterdam Unlversky Press, Amsterdam, 1996.
61 J.K. Uhlm~. Satisfying General Pro.,dmity/Similarit.y Queries with Metric Trees. Information ProccaMng Letters, 40(1991):175-179.
62 Peter N. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.311-321, January 25-27, 1993, Austin, Texas, United States
63 T. A. Welch, BOUNDS ON INFORMATION RETRIEVAL EFFICIENCY IN STATIC FILE STRUCTURES., Massachusetts Institute of Technology, Cambridge, MA, 1971
64 A C Yao, F F Yao, A general approach to d-dimensional geometric queries, Proceedings of the seventeenth annual ACM symposium on Theory of computing, p.163-168, May 06-08, 1985, Providence, Rhode Island, United States doi:10.1145/22145.22163,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
1998 ApproximateNearestNeighbors	Rajeev Motwani Piotr Indyk			Approximate Nearest Neighbors: Towards removing the curse of dimensionality		Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing	http://dx.doi.org/10.1145/276698.276876	10.1145/276698.276876		1998