【李国良论文】Vector Database Management Techniques and Systems.pdf

胖酒精灯

130

8页

3次

2025-04-15

免费下载

Vector Database Management Techniques and Systems

James Jie Pan

jamesjpan@tsinghua.edu.cn

Tsinghua University

Beijing, China

Jianguo Wang

csjgwang@purdue.edu

Purdue University

West Lafayette, Indiana, USA

Guoliang Li

liguoliang@tsinghua.edu.cn

Tsinghua University

Beijing, China

ABSTRACT

Feature vectors are now mission-critical for many applications, in-

cluding retrieval-based large language models (LLMs). Traditional

database management systems are not equipped to deal with the

unique characteristics of feature vectors, such as the vague notion

of semantic similarity, large size of vectors, expensive similarity

comparisons, lack of indexable structure, and diculty of answering

“hybrid” queries that combine structured attributes with feature vec-

tors. A number of vector database management systems (VDBMSs)

have been developed to address these challenges, combining novel

techniques for query processing, storage and indexing, and query

optimization and execution and culminating in a spectrum of perfor-

mance and accuracy characteristics and capabilities. In this tutorial,

we review the existing vector database management techniques and

systems. For query processing, we review similarity score design

and selection, vector query types, and vector query interfaces. For

storage and indexing, we review various indexes and discuss com-

pression as well as disk-resident indexes. For query optimization

and execution, we review hybrid query processing, hardware ac-

celeration, and distibuted search. We then review existing systems,

search engines and libraries, and benchmarks. Finally, we present

research challenges and open problems.

CCS CONCEPTS

• Information systems → Data management systems.

KEYWORDS

Vector Database, Vector Similarity Search, Dense Retrieval, 𝑘-NN

ACM Reference Format:

James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Vector Database

Management Techniques and Systems. In Companion of the 2024 Inter-

national Conference on Management of Data (SIGMOD-Companion ’24),

June 9–15, 2024, Santiago, AA, Chile. ACM, New York, NY, USA, 8 pages.

https://doi.org/10.1145/3626246.3654691

1 INTRODUCTION

High-dimensional feature vectors are now used in a variety of

dense retrieval search applications, including retrieval-based large

language models (LLMs) [

], e-commerce [

], recommen-

dation [

], document retrieval [

], and so on [

]. These

applications may involve billions of vectors and require millisecond

This work is licensed under a Creative Commons Attribution

International 4.0 License.

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile

ACM ISBN 979-8-4007-0422-2/24/06.

https://doi.org/10.1145/3626246.3654691

Figure 1: Overview of a VDBMS (Vector Database Manage-

ment System).

query latencies, all while needing to scale to increasing workloads

without sacricing performance or response quality.

But existing traditional database management systems, includ-

ing NoSQL and relational databases, are not designed for these

datasets and workloads. First, vector queries rely on the concept of

similarity which can be vague for dierent applications, requiring

a dierent query specication. Second, similarity computation is

more expensive than other types of comparisons seen in relational

predicates, requiring ecient techniques. Third, processing a vector

query often requires retrieving full vectors from the collection. But

each vector may be large, possibly spanning multiple disk pages,

and the cost of retrieval is more expensive compared to simple

attributes while also straining memory. Fourth, vector collections

lack obvious properties that can be used for indexing, such as being

sortable or ordinal, preventing the use of traditional techniques. Fi-

nally, “hybrid” queries require accessing both attributes and vectors

together, but it remains unclear how to do this eciently.

These challenges have led to the rise of vector database man-

agement systems (VDBMSs) specially adapted for these applica-

tions, and there are now over 20 commercial VDBMSs developed

within the past ve years. A typical VDBMS is composed of a query

processor and a storage manager (Figure 1). For the query proces-

sor, VDBMSs introduce new approaches to query interfaces, query

types such as

𝑘

-nearest-neighbor search, hybrid, and multi-vector

queries, and data operators such as similarity projection and hybrid

index scan. New techniques have also been proposed for query

597

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile James Jie Pan, Jianguo Wang, and Guoliang Li

optimization and execution, including plan enumeration and se-

lection for hybrid query plans and hardware accelerated search.

For the storage manager, several techniques for large-scale vector

indexing and vector storage are now available, including hashing

and quantization-based approaches like

[

] and IVFADC [

]

that tend to be easy to update; tree-based indexes like FLANN [

]

and ANNOY [

] that tend to support logarithmic search complex-

ity; and graph-based indexes like HNSW [

] and others that are

ecient in practice but with less theoretical understanding. To-

gether, these techniques culminate into a variety of native systems,

extended systems, and search engines and libraries over a spectrum

of performance and accuracy characteristics and capabilities.

In this tutorial, we review techniques and systems for vector

data management along with benchmarks, followed by remaining

challenges and open problems.

Tutorial Overview. This tutorial contains three parts and the

intended length is 1.5 hours. The rst part discusses specic tech-

niques and will last 50 minutes.

(1) Query Processing (10 min.). Query processing begins with a search

specication that denes the search parameters. These include

considerations for similarity score design, score selection and the

curse of dimensionality [

], the query type such as basic,

multi-vector, and hybrid queries [79, 84], and the query interface.

(2) Storage and Indexing (30 min.). Vector indexes tend to rely on

novel partitioning techniques such as randomization, learned par-

titioning, and navigable partitioning. The performance, accuracy,

and storage characteristics of an index depend on the techniques

used and the index structure. We classify indexes into table, tree, or

graph-based, and then describe the techniques for each index type.

To deal with large size, we also discuss vector compression using

quantization [49, 59] and disk-resident indexes [32, 74].

(3) Optimization and Execution (10 min.). For processing hybrid

queries, several plan enumeration and plan selection techniques

have been proposed, including rule-based [

] and cost-based

selection [

], and which also introduce new hybrid operators

including block-rst scan [

] and visit-rst scan [

]. Several techniques also speed up vector search via hardware

acceleration using SIMD [

], GPUs [

], and distributed search.

To address slow index updates, some VDBMSs also rely on out-of-

place updates.

The second part discusses vector database management systems

and will last 30 minutes.

(1) Native Systems (10 min). Native systems such as Pinecone [

Miluvs [

], and Manu [

] aim at high performance vector

search applications by oering a narrower range of capabilities.

(2) Extended Systems (10 min). For applications that require more so-

phisticated capabilities, several extended systems such as AnalyticDB-

V [

], PASE [

pgvector

[

], and Vespa [

] have been developed

based on NoSQL or relational systems.

(3) Search Engines and Libraries (5 min). Several search engines such

as Apache Lucene [

] and Elasticsearch [

] now also incorporate

vector search capability via integrated vector indexes. Several li-

braries are also available such as Meta Faiss [

] that provide vector

search functionality.

(4) Benchmarks (5 min). We describe two notable benchmarks that

evaluate a wide variety of search algorithms and systems over a

range of workloads [29, 91].

The nal part discusses challenges and open problems (10 min.).

We describe several fundamental challenges, including how to per-

form similarity score selection, design more ecient hybrid oper-

ators and indexes, and estimate the cost of hybrid plans. We also

describe future applications, including index-supported incremental

search, multi-vector search, and enhancing security and privacy.

Target Audience. This tutorial is intended for database researchers

interested in understanding and advancing the state-of-art tech-

niques for large-scale vector database management and modern

applications beyond similarity search. This tutorial may also ben-

et industry practitioners interested in learning about the latest

commercial systems. There are no prerequisites beyond a basic

understanding of database concepts.

Related Tutorials. A recent tutorial [

] discusses how vector

search can be used for retrieval-based LLMs. There are also separate

tutorials on similarity search techniques [37, 67, 68].

Our tutorial aims to complement these tutorials by focusing

on vector database management systems as a whole, and most of

this tutorial has not been covered elsewhere. Specically, most of

the overlap is conned to Section 2.2, and the extent is not large.

In [

], a broad taxonomy of search techniques is given, along

with representative examples. Similarly in [

], an overview of

various exact and approximate search techniques is given. Some

of the material in Section 2.2, mainly locality-sensitive hashing,

learning-to-hash, ANNOY, and HNSW, overlaps with these past

tutorials. But Section 2.2 also discusses key indexing trends in

VDBMSs that have not been discussed in past tutorials, including

disk-based indexes, quantization-based compression approaches for

handling large vector collections, and the diversity of graph-based

indexes. Aside from this section, all other sections in this tutorial

have not been covered elsewhere as they pertain to the VDBMS

as a whole, including query processing, hybrid operators, plan

enumeration, and plan selection, and a survey of existing VDBMSs.

2 TUTORIAL

2.1 Query Processing

Similarity Scores. Dense retrieval works based on similarity. A

similarity score can be used to quantify the degree of similarity be-

tween two feature vectors. While many scores have been proposed,

dierent scores may lead to dierent query results, and so how to

perform score selection is an important problem for VDBMSs, in

addition to score design.

(1) Score Design. A similarity score is designed to accurately capture

similarity relationships between feature vectors. Existing similarity

scores can be classied as basic scores, aggregate scores, and learned

scores. Basic scores are derived directly from the vector space and

include Hamming distance, inner product, cosine angle, Minkowski

distance, and Mahalanobis distance. For certain workloads involv-

ing multiple query or feature vectors per entity, aggregate scores

such as mean, weighted sum, and others [

] combine multiple

scores into a single scalar score that can be more easily compared.

598