69:2 Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng
1 Introduction
In recent years, various machine learning models, e.g.,
word2vec
[
33
,
35
],
doc2vec
[
25
], and
node2vec
[
16
], have been developed to eectively represent real-world objects such as images,
documents, and graphs as high-dimensional feature vectors. In the meanwhile, real-world objects
are often associated with structured attributes/elds such as timestamps, prices, and quantities. In
many scenarios, the feature vectors and the structured attributes of the objects need to be jointly
queried, as illustrated in the motivation examples below.
Example 1: Product Search. On e-commerce platforms like Amazon, a customer may search for
a wardrobe with conditions on price (less than $200) and style (visually similar to the one in an
image). The condition on the style can be formulated as an approximate nearest neighbor search
(ANNS) over the image feature vectors. However, the search should be conducted not over all the
wardrobes but over a subset of the wardrobes whose prices are less than $200.
Example 2: Vehicle Search. Consider a trac camera that monitors cars passing on a state
highway. The camera detects each car in the video stream, extracts a feature vector to represent it,
and stores the feature vector along with its corresponding timestamp in a database. When a query
arrives with a specic car image and a specied time interval, the query processing performs an
ANNS on the feature vectors within the specied time interval.
The above queries can be formulated as the range-ltering approximate nearest neighbor search
queries. Consider a collection of objects where each object is represented as a pair of a data vector
(i.e., feature vector) and an attribute value whose domain has a total order. A range-ltering ANNS
query consists of a query vector and a query range. The query reports the approximate nearest
neighbors of the query vector among all the data vectors whose attribute values fall in the query
range. The scenarios of range-ltering ANNS can be found in many applications such as face
recognition [15], person/vehicle re-identication [27, 47], and recommendation [39].
Existing methods for range-ltering ANNS rely on two simple strategies, ANNS-rst and range-
rst [
40
]. ANNS-rst builds an ANNS index over all the data vectors during the oine indexing
phase. In the online querying phase, when a range-ltering ANNS query arrives, it progressively
nds data vectors nearest to the query vector using the ANNS index. It stops when a data vector
whose attribute value falls in the query range. However, ANNS-rst is very slow especially when the
query range is small: the portion of the data vectors falling into a small query range is small and it
is thus hard to meet one in the ANNS. On the other hand, the range-rst strategy lters the data
vector with the query range rst based on their attribute values. Since there is no ANNS index over
the remaining data vectors available, it can only linearly scan all the data vectors in the query range
to nd the nearest neighbor, which is rather inecient, especially when the query range is large.
Hierarchical Navigable Small World graph (HNSW) [
30
] is an ecient index for ANNS query,
which has been widely used in both academia and industry. For example, the multi-modal feature
vector dataset LAION-5B, which contains 5.85 billion vectors [
37
], comes with an HNSW index for
ecient ANNS. Lucene 9.0 implements HNSW to support ANNS [
1
]. Moreover, HNSW index also
serves as the backbone of various vector databases such as Weaviate [
3
], Zilliz [
4
], and Pinecone [
2
].
Using HNSW as a building block, one can build one ANNS index (HNSW) for every possible query
range: when a range-ltering ANNS query arrives, use the corresponding ANNS index of the query
range to nd and report a set of approximate nearest neighbors of the query vector. This solution
is query-ecient, yet it faces the challenge of the excessively large index size and long index time:
the total number of possible query ranges is quadratic to the number of data vectors.
To address the above challenges, for a set of
𝑛
data vectors, we rst consider half-b ounded range
queries where the query range includes all the attribute values smaller than a user-provided value.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 69. Publication date: February 2024.
评论