SIGMOD-2024-SeRF Segment Graph for Range-Filtering Approximate Nearest Neighbor Search_阿里云.pdf

赤井秀一

1034

26页

0次

2024-06-24

免费下载

SeRF: Segment Graph for Range-Filtering Approximate

Nearest Neighbor Search

CHAOJI ZUO

∗

, Rutgers University, USA

MIAO QIAO, The University of Auckland, New Zealand

WENCHAO ZHOU, Alibaba Group, China

FEIFEI LI, Alibaba Group, China

DONG DENG

†

, Rutgers University, USA

Eective vector representation models, e.g., word2vec and node2vec, embed real-world objects such as images

and documents in high dimensional vector space. In the meanwhile, the objects are often associated with

attributes such as timestamps and prices. Many scenarios need to jointly query the vector representations

of the objects together with their attributes. These queries can be formalized as range-ltering approximate

nearest neighbor search (ANNS) queries. Specically, given a collection of data vectors, each associated with

an attribute value whose domain has a total order. The range-ltering ANNS consists of a query range and

a query vector. It nds the approximate nearest neighbors of the query vector among all the data vectors

whose attribute values fall in the query range. Existing approaches suer from a rapidly degrading query

performance when the query range width shifts. The query performance can be optimized by a solution

that builds an ANNS index for every possible query range; however, the index time and index size become

prohibitive – the number of query ranges is quadratic to the number

𝑛

of data vectors. To overcome these

challenges, for the query range contains all attribute values smaller than a user-provided threshold, we design

a structure called the segment graph whose index time and size are the same as a single ANNS index, yet can

losslessly compress the

𝑛

ANNS indexes, reducing the indexing cost by a factor of

Ω(𝑛)

. To handle general

range queries, we propose a 2D segment graph with average-case index size

𝑂 (𝑛 log𝑛)

to compress

𝑛

segment

graphs, breaking the quadratic barrier. Extensive experiments conducted on real-world datasets show that our

proposed structures outperformed existing methods signicantly; our index also exhibits superior scalability.

CCS Concepts: • Information systems

→

Nearest-neighbor search; • Theory of computation

→

Nearest

neighbor algorithms.

Additional Key Words and Phrases: Similarity Search, Multi-Model Search, High-Dimensional Vector

ACM Reference Format:

Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. 2024. SeRF: Segment Graph for Range-

Filtering Approximate Nearest Neighbor Search. Proc. ACM Manag. Data 2, 1 (SIGMOD), Article 69 (Febru-

ary 2024), 26 pages. https://doi.org/10.1145/3639324

∗

Work performed during an internship at Alibaba Group

†

Corresponding author

Authors’ addresses: Chaoji Zuo, Rutgers University, Piscataway, USA, chaoji.zuo@rutgers.edu; Miao Qiao, The University

of Auckland, Auckland, New Zealand, miao.qiao@auckland.ac.nz; Wenchao Zhou, Alibaba Group, Hangzhou, China,

zwc231487@alibaba-inc.com; Feifei Li, Alibaba Group, Hangzhou, China, lifeifei@alibaba-inc.com; Dong Deng, Rutgers

University, Piscataway, USA, dong.deng@rutgers.edu.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

ACM 2836-6573/2024/2-ART69

https://doi.org/10.1145/3639324

Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 69. Publication date: February 2024.

69:2 Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng

1 Introduction

In recent years, various machine learning models, e.g.,

word2vec

[

doc2vec

[

], and

node2vec

[

], have been developed to eectively represent real-world objects such as images,

documents, and graphs as high-dimensional feature vectors. In the meanwhile, real-world objects

are often associated with structured attributes/elds such as timestamps, prices, and quantities. In

many scenarios, the feature vectors and the structured attributes of the objects need to be jointly

queried, as illustrated in the motivation examples below.

Example 1: Product Search. On e-commerce platforms like Amazon, a customer may search for

a wardrobe with conditions on price (less than $200) and style (visually similar to the one in an

image). The condition on the style can be formulated as an approximate nearest neighbor search

(ANNS) over the image feature vectors. However, the search should be conducted not over all the

wardrobes but over a subset of the wardrobes whose prices are less than $200.

Example 2: Vehicle Search. Consider a trac camera that monitors cars passing on a state

highway. The camera detects each car in the video stream, extracts a feature vector to represent it,

and stores the feature vector along with its corresponding timestamp in a database. When a query

arrives with a specic car image and a specied time interval, the query processing performs an

ANNS on the feature vectors within the specied time interval.

The above queries can be formulated as the range-ltering approximate nearest neighbor search

queries. Consider a collection of objects where each object is represented as a pair of a data vector

(i.e., feature vector) and an attribute value whose domain has a total order. A range-ltering ANNS

query consists of a query vector and a query range. The query reports the approximate nearest

neighbors of the query vector among all the data vectors whose attribute values fall in the query

range. The scenarios of range-ltering ANNS can be found in many applications such as face

recognition [15], person/vehicle re-identication [27, 47], and recommendation [39].

Existing methods for range-ltering ANNS rely on two simple strategies, ANNS-rst and range-

rst [

]. ANNS-rst builds an ANNS index over all the data vectors during the oine indexing

phase. In the online querying phase, when a range-ltering ANNS query arrives, it progressively

nds data vectors nearest to the query vector using the ANNS index. It stops when a data vector

whose attribute value falls in the query range. However, ANNS-rst is very slow especially when the

query range is small: the portion of the data vectors falling into a small query range is small and it

is thus hard to meet one in the ANNS. On the other hand, the range-rst strategy lters the data

vector with the query range rst based on their attribute values. Since there is no ANNS index over

the remaining data vectors available, it can only linearly scan all the data vectors in the query range

to nd the nearest neighbor, which is rather inecient, especially when the query range is large.

Hierarchical Navigable Small World graph (HNSW) [

] is an ecient index for ANNS query,

which has been widely used in both academia and industry. For example, the multi-modal feature

vector dataset LAION-5B, which contains 5.85 billion vectors [

], comes with an HNSW index for

ecient ANNS. Lucene 9.0 implements HNSW to support ANNS [

]. Moreover, HNSW index also

serves as the backbone of various vector databases such as Weaviate [

], Zilliz [

], and Pinecone [

Using HNSW as a building block, one can build one ANNS index (HNSW) for every possible query

range: when a range-ltering ANNS query arrives, use the corresponding ANNS index of the query

range to nd and report a set of approximate nearest neighbors of the query vector. This solution

is query-ecient, yet it faces the challenge of the excessively large index size and long index time:

the total number of possible query ranges is quadratic to the number of data vectors.

To address the above challenges, for a set of

𝑛

data vectors, we rst consider half-b ounded range

queries where the query range includes all the attribute values smaller than a user-provided value.

Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 69. Publication date: February 2024.

of 26

免费下载

关注

评论