1998 ApproximateNearestNeighbors
- (Indyk & Motwani, 1998) ⇒ Piotr Indyk, and Rajeev Motwani. (1998). “Approximate Nearest Neighbors: Towards removing the curse of dimensionality.” In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. doi:10.1145/276698.276876
Subject Headings: Nearest Neighbor Search Task, Curse of Dimensionality.
Notes
Cited By
Quotes
Abstract
The nearest neighbor problem is the following: Given a set of [math]\displaystyle{ n }[/math] points P={p_1,...,p_n} in some metric space [math]\displaystyle{ X }[/math], preprocess P/ so as to efficiently answer queries which require finding the point in [math]\displaystyle{ P }[/math] closest to a query point q ∈ X. We focus on the particularly interesting case of the [math]\displaystyle{ d }[/math]-dimensional Euclidean space where X = R^d under some l_p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large [math]\displaystyle{ d }[/math], in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p ∈ P that is an e-approximate nearest neighbor of the query [math]\displaystyle{ q }[/math] in that for all p’ ∈ P, d(p, q) < (1 + e)d(p’, q).
We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in [math]\displaystyle{ n }[/math] and [math]\displaystyle{ d }[/math], and a truly sublinear query time (for [math]\displaystyle{ e }[/math] > 1); and, (b) query time polynomial in log-n and [math]\displaystyle{ d }[/math], and only a mildly exponential preprocessing cost Õ(n) × O(1/e^d). Further, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in [math]\displaystyle{ d }[/math] and log n. Unfortunately, for small [math]\displaystyle{ e }[/math], the latter is a purely theoretical result since the exponent depends on l/e. Experimental results indicate that our first algorithm offers orders of magnitude improvement on running times over real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; here, we give applications to information retrieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms.
References
- 1 Pankaj K. Agarwal, Jirí Matoušek, Ray shooting and parametric search, Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, p.517-526, May 04-06, 1992, Victoria, British Columbia, Canada doi:10.1145/129712.129763
- 2 Sunil Arya, David M. Mount, Approximate nearest neighbor queries in fixed dimensions, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.271-280, January 25-27, 1993, Austin, Texas, United States
- 3 13, Arya, D,M, Mount, and O. Narayan, Accounting for boundary effects in nearest-neighbor searching. Discrete and Computational Geometry, 16(1996):155-176.
- 4 Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Wu, An optimal algorithm for approximate nearest neighbor searching, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.573-582, January 23-25, 1994, Arlington, Virginia, United States
- 5 A. Andersson, Static Dictionaries on RAMs: Query Time is Necessary and Sufficient, Proceedings of the 37th Annual Symposium on Foundations of Computer Science, p.441, October 14-16, 1996
- 6 Jon Louis Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM, v.18 n.9, p.509-517, Sept. 1975 doi:10.1145/361002.361007
- 7 Marshall Bern, Approximate closest-point queries in high dimensions, Information Processing Letters, v.45 n.2, p.95-99, Feb. 26, 1993 doi:10.1016/0020-0190(93)90222-U
- 8 M. W. Berry, S. T. D %A, A. T. Shippy, A Case Study of Latent Semantic Indexing, University of Tennessee, Knoxville, TN, 1995
- 9 Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth International Conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
- 10 C. Buddey, A. Singhal, M. Mitra, and Gerard M. Salton. New Retrieval Approaches Using SMART: TREC 4. In: Pro. ceedings of the Fourth Text Retrieval Conference, National Institute of Standards and Technology, 1995.
- 11 W. A. Burkhard, R. M. Keller, Some approaches to best-match file searching, Communications of the ACM, v.16 n.4, p.230-236, April 1973 doi:10.1145/362003.362025
- 12 Tolga Bozkaya, Meral Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces, Proceedings of the 1997 ACM SIGMOD Conference, p.357-368, May 11-15, 1997, Tucson, Arizona, United States
- 13 F. Cazals, Effective nearest neighbors searching on the hyper-cube, with applications to molecular clustering, Proceedings of the fourteenth annual symposium on Computational geometry, p.222-230, June 07-10, 1998, Minneapolis, Minnesota, United States doi:10.1145/276884.276910
- 14 Timothy M. Chan, Approximate nearest neighbor queries revisited, Proceedings of the thirteenth annual symposium on Computational geometry, p.352-358, June 04-06, 1997, Nice, France doi:10.1145/262839.263001
- 15 Kenneth L. Clarkson. (1988). “A Randomized Algorithm for Closest-Point Queries.” In: SIAM Journal on Computing, 17(4). doi:10.1137/0217052
- 16 Kenneth L. Clarkson. (1994). “An Algorithm for Approximate Closest-Point Queries.” In: Proceedings of the tenth annual symposium on Computational Geometry. doi:10.1145/177424.177609.
- 17 Kenneth L. Clarkson. (1997). “Nearest neighbor queries in metric spaces.” In: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. doi:10.1145/258533.258655
- 18 Scott Cost, Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features, Machine Learning, v.10 n.1, p.57-78, Jan. 1993 doi:10.1023/A:1022664626993
- 19 T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. {EEE Transactions on Information The. ory, 13(1967):21-27.
- 20 $. Deerwester, S. T. Dumais, T.K. Landaner, G.W. Furhas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society fo~ Information Sci. ences, 41(1990):391-407.
- 21 L. Devroye and T.J. Wagner, Nearest neighbor methods in discrimination. In: Handbook of Statistics, vol. 2, P.R. Krishnaiah and L.N. Kanal, eds., North-Holland, 1982.
- 22 D. Dobkin and R. Lipton. Multidimensional search problems. SIAM Journal on Computing, 5(1976):181- 186.
- 23 Danny Dolev, Yuval Harari, Nathan Linial, Noam Nisan, Michal Parnas, Neighborhood preserving hashing and approximate queries, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.251-259, January 23-25, 1994, Arlington, Virginia, United States
- 24 D. Dolev, Y. Harari, and M. Parnas. Finding the neighborhood of a query in a dictionary. In: Proceedings of the ~nd Israel Symposium on Theory and Computing Systems, 1993, pp. 33-42.
- 25 Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000
- 26 Herbert Edelsbrunner. (1987). “Algorithms in Combinatorial Geometry. Springer-Verlag
- 27 David Eppstein, Fast hiearchical clustering and other applications of dynamic closet pairs, Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, p.619-628, January 25-27, 1998, San Francisco, California, United States
- 28 Christos Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994 doi:10.1007/BF00962238
- 29 W. Feller. An introduction to Probability Theory and its Applications. John Wiley & Sons, NY, 1991.
- 30 Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker, Query by Image and Video Content: The QBIC System, Computer, v.28 n.9, p.23-32, September 1995 doi:10.1109/2.410146
- 31 William B. Frakes, Ricardo Baeza-Yates, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
- 32 P. Frankl, H. Maehara, The Johnson-Lindenstrauss Lemma and the sphericity of some graphs, Journal of Combinatorial Theory Series A, v.44 n.3, p.355-362, June 1987
- 33 Michael L. Fredman, János Komlós, Endre Szemerédi, Storing a Sparse Table with 0(1) Worst Case Access Time, Journal of the ACM (JACM), v.31 n.3, p.538-544, July 1984 doi:10.1145/828.1884
- 34 Jerome H. Freidman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software (TOMS), v.3 n.3, p.209-226, Sept. 1977 doi:10.1145/355744.355745
- 35 Allen Gersho, Robert M. Gray, Vector quantization and signal compression, Kluwer Academic Publishers, Norwell, MA, 1991
- 36 A. Gionis, P. Indyk, and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. Manuscript, 1997.
- 37 D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. In: Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 722-731.
- 38 Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. In: First InternaSonal Conference on Knowledge Discovery f.4 Data Mining, 1995, pp. 142-149.
- 39 H. HoteUing. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 27( 1933):417-441.
- 40 P. Indyk, Rajeev Motwani, and S. Venkatasubramanian. Geometric Matching Under Noise- Combinatorial Bounds and Algorithms. Manuscript, 1997.
- 41 W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26(1984):189-206.
- 42 W.B. Jotmson and G. Schechtman. Embedding l~~ into l{'. Acta Mathematica, 149(1982):71-85.
- 43 K. Karhunen. Uber lineare Methoden in dew Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae, Set. A137, 1947.
- 44 V. Koivune and S. Kassam. Nearest neighbor filters for multivariate data. IEEE l'~rkshop on Nonlinear Signal and Image Processing, 1995.
- 45 Jon Kleinberg, Two algorithms for nearest-neighbor search in high dimensions, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.599-608, May 04-06, 1997, El Paso, Texas, United States doi:10.1145/258533.258653
- 46 Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.614-623, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276877
- 47 R. M. Karp, O. Waarts, G. Zweig, The bit vector intersection problem, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.621, October 23-25, 1995
- 48 N. Linial, E. London, and Y. Rabinovich. The geomctry of graphs and some of its algorithmic appllcation~. In: Proceedings of 85th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 577-591.
- 49 M. Loire. Fonctions aleastoires de second ordere. Processus Stochastiques et mouvcment Brownian, Hermann, Paris, 1948.
- 50 Jirí Matoušek, Reporting points in halfspaces, Computational Geometry: Theory and Applications, v.2 n.3, p.169-186, Nov. 1992 doi:10.1016/0925-7721(92)90006-E
- 51 S. Meiser, Point location in arrangements of hyperplanes, Information and Computation, v.106 n.2, p.286-303, Oct. 1993 doi:10.1006/inco.1993.1057
- 52 Nimrod Megiddo, Applying Parallel Computation Algorithms in the Design of Serial Algorithms, Journal of the ACM (JACM), v.30 n.4, p.852-865, Oct. 1983 doi:10.1145/2157.322410
- 53 M. Minsky and S. Papert. Perceptrona. MIT Prcs~, Cambridge, MA, 1969.
- 54 Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
- 55 A. Pentland, R.W. Picard, and S. Sdaroff. Photobook: tools for content-based manipulation of image databases. In: Proceedings of the SPiE Confercncc on Storage and Retrieval of Image and Vidco Databases II, 1994.
- 56 G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Universit4' Press, 1989.
- 57 Gerard M. Salton, Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986
- 58 Hanan Samet, The design and analysis of spatial data structures, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1990
- 59 Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos, Multidimensional Access Methods: Trees Have Grown Everywhere, Proceedings of the 23rd International Conference on Very Large Data Bases, p.13-14, August 25-29, 1997
- 60 A.W.M. Smeulders and R. JaJn, eds. Image Databaaca and Multi-media Search. Proceedings of the First International Workshop, IDB-MIvIS '96, Amsterdam Unlversky Press, Amsterdam, 1996.
- 61 J.K. Uhlm~. Satisfying General Pro.,dmity/Similarit.y Queries with Metric Trees. Information ProccaMng Letters, 40(1991):175-179.
- 62 Peter N. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.311-321, January 25-27, 1993, Austin, Texas, United States
- 63 T. A. Welch, BOUNDS ON INFORMATION RETRIEVAL EFFICIENCY IN STATIC FILE STRUCTURES., Massachusetts Institute of Technology, Cambridge, MA, 1971
- 64 A C Yao, F F Yao, A general approach to d-dimensional geometric queries, Proceedings of the seventeenth annual ACM symposium on Theory of computing, p.163-168, May 06-08, 1985, Providence, Rhode Island, United States doi:10.1145/22145.22163,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
1998 ApproximateNearestNeighbors | Rajeev Motwani Piotr Indyk | Approximate Nearest Neighbors: Towards removing the curse of dimensionality | Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing | http://dx.doi.org/10.1145/276698.276876 | 10.1145/276698.276876 | 1998 |