暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
SIGMOD-2024-SeRF Segment Graph for Range-Filtering Approximate Nearest Neighbor Search_阿里云.pdf
1034
26页
0次
2024-06-24
免费下载
SeRF: Segment Graph for Range-Filtering Approximate
Nearest Neighbor Search
CHAOJI ZUO
, Rutgers University, USA
MIAO QIAO, The University of Auckland, New Zealand
WENCHAO ZHOU, Alibaba Group, China
FEIFEI LI, Alibaba Group, China
DONG DENG
, Rutgers University, USA
Eective vector representation models, e.g., word2vec and node2vec, embed real-world objects such as images
and documents in high dimensional vector space. In the meanwhile, the objects are often associated with
attributes such as timestamps and prices. Many scenarios need to jointly query the vector representations
of the objects together with their attributes. These queries can be formalized as range-ltering approximate
nearest neighbor search (ANNS) queries. Specically, given a collection of data vectors, each associated with
an attribute value whose domain has a total order. The range-ltering ANNS consists of a query range and
a query vector. It nds the approximate nearest neighbors of the query vector among all the data vectors
whose attribute values fall in the query range. Existing approaches suer from a rapidly degrading query
performance when the query range width shifts. The query performance can be optimized by a solution
that builds an ANNS index for every possible query range; however, the index time and index size become
prohibitive – the number of query ranges is quadratic to the number
𝑛
of data vectors. To overcome these
challenges, for the query range contains all attribute values smaller than a user-provided threshold, we design
a structure called the segment graph whose index time and size are the same as a single ANNS index, yet can
losslessly compress the
𝑛
ANNS indexes, reducing the indexing cost by a factor of
Ω(𝑛)
. To handle general
range queries, we propose a 2D segment graph with average-case index size
𝑂 (𝑛 log𝑛)
to compress
𝑛
segment
graphs, breaking the quadratic barrier. Extensive experiments conducted on real-world datasets show that our
proposed structures outperformed existing methods signicantly; our index also exhibits superior scalability.
CCS Concepts: Information systems
Nearest-neighbor search; Theory of computation
Nearest
neighbor algorithms.
Additional Key Words and Phrases: Similarity Search, Multi-Model Search, High-Dimensional Vector
ACM Reference Format:
Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. 2024. SeRF: Segment Graph for Range-
Filtering Approximate Nearest Neighbor Search. Proc. ACM Manag. Data 2, 1 (SIGMOD), Article 69 (Febru-
ary 2024), 26 pages. https://doi.org/10.1145/3639324
Work performed during an internship at Alibaba Group
Corresponding author
Authors’ addresses: Chaoji Zuo, Rutgers University, Piscataway, USA, chaoji.zuo@rutgers.edu; Miao Qiao, The University
of Auckland, Auckland, New Zealand, miao.qiao@auckland.ac.nz; Wenchao Zhou, Alibaba Group, Hangzhou, China,
zwc231487@alibaba-inc.com; Feifei Li, Alibaba Group, Hangzhou, China, lifeifei@alibaba-inc.com; Dong Deng, Rutgers
University, Piscataway, USA, dong.deng@rutgers.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 2836-6573/2024/2-ART69
https://doi.org/10.1145/3639324
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 69. Publication date: February 2024.
69:2 Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng
1 Introduction
In recent years, various machine learning models, e.g.,
word2vec
[
33
,
35
],
doc2vec
[
25
], and
node2vec
[
16
], have been developed to eectively represent real-world objects such as images,
documents, and graphs as high-dimensional feature vectors. In the meanwhile, real-world objects
are often associated with structured attributes/elds such as timestamps, prices, and quantities. In
many scenarios, the feature vectors and the structured attributes of the objects need to be jointly
queried, as illustrated in the motivation examples below.
Example 1: Product Search. On e-commerce platforms like Amazon, a customer may search for
a wardrobe with conditions on price (less than $200) and style (visually similar to the one in an
image). The condition on the style can be formulated as an approximate nearest neighbor search
(ANNS) over the image feature vectors. However, the search should be conducted not over all the
wardrobes but over a subset of the wardrobes whose prices are less than $200.
Example 2: Vehicle Search. Consider a trac camera that monitors cars passing on a state
highway. The camera detects each car in the video stream, extracts a feature vector to represent it,
and stores the feature vector along with its corresponding timestamp in a database. When a query
arrives with a specic car image and a specied time interval, the query processing performs an
ANNS on the feature vectors within the specied time interval.
The above queries can be formulated as the range-ltering approximate nearest neighbor search
queries. Consider a collection of objects where each object is represented as a pair of a data vector
(i.e., feature vector) and an attribute value whose domain has a total order. A range-ltering ANNS
query consists of a query vector and a query range. The query reports the approximate nearest
neighbors of the query vector among all the data vectors whose attribute values fall in the query
range. The scenarios of range-ltering ANNS can be found in many applications such as face
recognition [15], person/vehicle re-identication [27, 47], and recommendation [39].
Existing methods for range-ltering ANNS rely on two simple strategies, ANNS-rst and range-
rst [
40
]. ANNS-rst builds an ANNS index over all the data vectors during the oine indexing
phase. In the online querying phase, when a range-ltering ANNS query arrives, it progressively
nds data vectors nearest to the query vector using the ANNS index. It stops when a data vector
whose attribute value falls in the query range. However, ANNS-rst is very slow especially when the
query range is small: the portion of the data vectors falling into a small query range is small and it
is thus hard to meet one in the ANNS. On the other hand, the range-rst strategy lters the data
vector with the query range rst based on their attribute values. Since there is no ANNS index over
the remaining data vectors available, it can only linearly scan all the data vectors in the query range
to nd the nearest neighbor, which is rather inecient, especially when the query range is large.
Hierarchical Navigable Small World graph (HNSW) [
30
] is an ecient index for ANNS query,
which has been widely used in both academia and industry. For example, the multi-modal feature
vector dataset LAION-5B, which contains 5.85 billion vectors [
37
], comes with an HNSW index for
ecient ANNS. Lucene 9.0 implements HNSW to support ANNS [
1
]. Moreover, HNSW index also
serves as the backbone of various vector databases such as Weaviate [
3
], Zilliz [
4
], and Pinecone [
2
].
Using HNSW as a building block, one can build one ANNS index (HNSW) for every possible query
range: when a range-ltering ANNS query arrives, use the corresponding ANNS index of the query
range to nd and report a set of approximate nearest neighbors of the query vector. This solution
is query-ecient, yet it faces the challenge of the excessively large index size and long index time:
the total number of possible query ranges is quadratic to the number of data vectors.
To address the above challenges, for a set of
𝑛
data vectors, we rst consider half-b ounded range
queries where the query range includes all the attribute values smaller than a user-provided value.
Proc. ACM Manag. Data, Vol. 2, No. 1 (SIGMOD), Article 69. Publication date: February 2024.
of 26
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜