SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile James Jie Pan, Jianguo Wang, and Guoliang Li
optimization and execution, including plan enumeration and se-
lection for hybrid query plans and hardware accelerated search.
For the storage manager, several techniques for large-scale vector
indexing and vector storage are now available, including hashing
and quantization-based approaches like
PQ
[
1
] and IVFADC [
49
]
that tend to be easy to update; tree-based indexes like FLANN [
62
]
and ANNOY [
2
] that tend to support logarithmic search complex-
ity; and graph-based indexes like HNSW [
58
] and others that are
ecient in practice but with less theoretical understanding. To-
gether, these techniques culminate into a variety of native systems,
extended systems, and search engines and libraries over a spectrum
of performance and accuracy characteristics and capabilities.
In this tutorial, we review techniques and systems for vector
data management along with benchmarks, followed by remaining
challenges and open problems.
Tutorial Overview. This tutorial contains three parts and the
intended length is 1.5 hours. The rst part discusses specic tech-
niques and will last 50 minutes.
(1) Query Processing (10 min.). Query processing begins with a search
specication that denes the search parameters. These include
considerations for similarity score design, score selection and the
curse of dimensionality [
22
,
30
,
61
], the query type such as basic,
multi-vector, and hybrid queries [79, 84], and the query interface.
(2) Storage and Indexing (30 min.). Vector indexes tend to rely on
novel partitioning techniques such as randomization, learned par-
titioning, and navigable partitioning. The performance, accuracy,
and storage characteristics of an index depend on the techniques
used and the index structure. We classify indexes into table, tree, or
graph-based, and then describe the techniques for each index type.
To deal with large size, we also discuss vector compression using
quantization [49, 59] and disk-resident indexes [32, 74].
(3) Optimization and Execution (10 min.). For processing hybrid
queries, several plan enumeration and plan selection techniques
have been proposed, including rule-based [
3
,
4
] and cost-based
selection [
79
,
84
], and which also introduce new hybrid operators
including block-rst scan [
43
,
79
,
84
,
87
] and visit-rst scan [
43
,
87
]. Several techniques also speed up vector search via hardware
acceleration using SIMD [
26
,
27
], GPUs [
50
], and distributed search.
To address slow index updates, some VDBMSs also rely on out-of-
place updates.
The second part discusses vector database management systems
and will last 30 minutes.
(1) Native Systems (10 min). Native systems such as Pinecone [
5
],
Miluvs [
6
,
79
], and Manu [
45
] aim at high performance vector
search applications by oering a narrower range of capabilities.
(2) Extended Systems (10 min). For applications that require more so-
phisticated capabilities, several extended systems such as AnalyticDB-
V [
84
], PASE [
90
],
pgvector
[
7
], and Vespa [
4
] have been developed
based on NoSQL or relational systems.
(3) Search Engines and Libraries (5 min). Several search engines such
as Apache Lucene [
8
] and Elasticsearch [
9
] now also incorporate
vector search capability via integrated vector indexes. Several li-
braries are also available such as Meta Faiss [
1
] that provide vector
search functionality.
(4) Benchmarks (5 min). We describe two notable benchmarks that
evaluate a wide variety of search algorithms and systems over a
range of workloads [29, 91].
The nal part discusses challenges and open problems (10 min.).
We describe several fundamental challenges, including how to per-
form similarity score selection, design more ecient hybrid oper-
ators and indexes, and estimate the cost of hybrid plans. We also
describe future applications, including index-supported incremental
search, multi-vector search, and enhancing security and privacy.
Target Audience. This tutorial is intended for database researchers
interested in understanding and advancing the state-of-art tech-
niques for large-scale vector database management and modern
applications beyond similarity search. This tutorial may also ben-
et industry practitioners interested in learning about the latest
commercial systems. There are no prerequisites beyond a basic
understanding of database concepts.
Related Tutorials. A recent tutorial [
28
] discusses how vector
search can be used for retrieval-based LLMs. There are also separate
tutorials on similarity search techniques [37, 67, 68].
Our tutorial aims to complement these tutorials by focusing
on vector database management systems as a whole, and most of
this tutorial has not been covered elsewhere. Specically, most of
the overlap is conned to Section 2.2, and the extent is not large.
In [
37
], a broad taxonomy of search techniques is given, along
with representative examples. Similarly in [
67
,
68
], an overview of
various exact and approximate search techniques is given. Some
of the material in Section 2.2, mainly locality-sensitive hashing,
learning-to-hash, ANNOY, and HNSW, overlaps with these past
tutorials. But Section 2.2 also discusses key indexing trends in
VDBMSs that have not been discussed in past tutorials, including
disk-based indexes, quantization-based compression approaches for
handling large vector collections, and the diversity of graph-based
indexes. Aside from this section, all other sections in this tutorial
have not been covered elsewhere as they pertain to the VDBMS
as a whole, including query processing, hybrid operators, plan
enumeration, and plan selection, and a survey of existing VDBMSs.
2 TUTORIAL
2.1 Query Processing
Similarity Scores. Dense retrieval works based on similarity. A
similarity score can be used to quantify the degree of similarity be-
tween two feature vectors. While many scores have been proposed,
dierent scores may lead to dierent query results, and so how to
perform score selection is an important problem for VDBMSs, in
addition to score design.
(1) Score Design. A similarity score is designed to accurately capture
similarity relationships between feature vectors. Existing similarity
scores can be classied as basic scores, aggregate scores, and learned
scores. Basic scores are derived directly from the vector space and
include Hamming distance, inner product, cosine angle, Minkowski
distance, and Mahalanobis distance. For certain workloads involv-
ing multiple query or feature vectors per entity, aggregate scores
such as mean, weighted sum, and others [
79
] combine multiple
scores into a single scalar score that can be more easily compared.
评论