2023 SurveyofVectorDatabaseManagemen
- (Pan, Wang et al., 2023) ⇒ James Jie Pan, Jianguo Wang, and Guoliang Li. (2023). “Survey of Vector Database Management Systems.” doi:10.48550/arXiv.2310.14021
Subject Headings: Vector DBMS Platform, Vector DBMS.
Notes
- It thoroughly evaluates over 20 commercial Vector Database Management Systems (VDBMSs) that have emerged in recent years, focusing on the obstacles in managing vector data.
- It details the process of query processing in VDBMSs, discussing aspects like similarity scores, query types, and interfaces, along with the complexities of basic search query operators.
- It outlines various storage and indexing strategies used in VDBMSs, including partitioning techniques (like randomization and learned partitioning) and different types of indexes such as tree-based, table-based, and graph-based.
- It delves into the optimization and execution aspects of VDBMSs, explaining plan enumeration, selection, hybrid operators for predicated queries, and the utilization of hardware acceleration and distributed search techniques.
- It classifies current VDBMSs into categories such as native, extended, and search engines/libraries, analyzing their design and runtime characteristics to highlight each type's strengths.
- It acknowledges the importance of benchmarks in evaluating VDBMSs, but it doesn't provide an in-depth analysis of specific benchmarks, suggesting an area for future exploration.
- It summarizes the existing challenges in vector data management and points towards potential directions for future research, highlighting the need for comprehensive studies and advanced techniques in this evolving field.
Cited By
Quotes
Abstract
There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 SurveyofVectorDatabaseManagemen | Guoliang Li James Jie Pan Jianguo Wang | Survey of Vector Database Management Systems | 10.48550/arXiv.2310.14021 | 2023 |