Manu- A Cloud Native Vector Database Management System.pdf

章芋文

461

14页

8次

2022-10-09

免费下载

Manu: A Cloud Native Vector Database Management System

Rentong Guo

†∗

, Xiaofan Luan

†∗

, Long Xiang

‡∗

, Xiao Yan

‡∗

, Xiaomeng Yi

†∗

, Jigao Luo

†§

Qianya Cheng

†

, Weizhi Xu

†

, Jiarui Luo

‡

, Frank Liu

†

, Zhenshan Cao

†

, Yanliang Qiao

†

, Ting Wang

†

Bo Tang

‡

, Charles Xie

†

Zilliz

‡

Southern University of Science and Technology

Technical University of Munich

†

{rstname.lastname}@zilliz.com

‡

{xiangl3@mail., yanx@, 11911419@mail., tangb3@}sustech.edu.cn

jigao.luo@tum.de

ABSTRACT

With the development of learning-based embedding models, embed-

ding vectors are widely used for analyzing and searching unstruc-

tured data. As vector collections exceed billion-scale, fully managed

and horizontally scalable vector databases are necessary. In the

past three years, through interaction with our 1200+ industry users,

we have sketched a vision for the features that next-generation

vector databases should have, which include long-term evolvability,

tunable consistency, good elasticity, and high performance.

We present

Manu

, a cloud native vector database that imple-

ments these features. It is dicult to integrate all these features

if we follow traditional DBMS design rules. As most vector data

applications do not require complex data models and strong data

consistency, our design philosophy is to relax the data model and

consistency constraints in exchange for the aforementioned fea-

tures. Specically,

Manu

rstly exposes the write-ahead log (WAL)

and binlog as backbone services. Secondly, write components are

designed as log publishers while all read-only analytic and search

components are designed as independent subscribers to the log ser-

vices. Finally, we utilize multi-version concurrency control (MVCC)

and a delta consistency model to simplify the communication and

cooperation among the system components. These designs achieve

a low coupling among the system components, which is essential

for elasticity and evolution. We also extensively optimize

Manu

for

performance and usability with hardware-aware implementations

and support for complex search semantics.

Manu

has been used

for many applications, including, but not limited to, recommenda-

tion, multimedia, language, medicine and security. We evaluated

Manu

in three typical application scenarios to demonstrate its e-

ciency, elasticity, and scalability.

PVLDB Reference Format:

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo,

Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang

Qiao, Ting Wang, Bo Tang, and Charles Xie. Manu: A Cloud Native Vector

Database Management System. PVLDB, 15(12): 3548 - 3561, 2022.

doi:10.14778/3554821.3554843

∗

Co-rst-authors are ordered alphabetically.

‡

Work done while working with Zilliz, correspondence to Bo Tang.

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 15, No. 12 ISSN 2150-8097.

doi:10.14778/3554821.3554843

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/milvus-io/milvus/tree/2.0.

1 INTRODUCTION

According to IDC, unstructured data, such as text, images, and video,

took up about 80% of the 40,000 exabytes of new data generated in

2020, their percentage keeps rising due to the increasing amount

of human-generated rich media [

]. With the rise of learning-

based embedding models, especially deep neural networks, using

embedding vectors to manage unstructured data has become com-

monplace in many applications such as e-commerce, social media,

and drug discovery [

]. A core feature of these applica-

tions is that they encode the semantics of unstructured data into a

high-dimensional vector space. Given the representation power of

embedding vectors, operations like recommendation, search, and

analysis can be implemented via similarity-based vector search. To

support these applications, many specialized vector databases are

built to manage vector data [10, 12, 17–19, 80].

In 2019, we open sourced Milvus [

], our previous vector data-

base, under the LF AI & Data Foundation. Since then, we collected

feed-backs from more than 1200 industry users and found that some

of the design principles adopted by Milvus are not suitable. Milvus

followed the design principles of relational databases, which are

optimized for either transaction [

] or analytical [

] workloads,

and focused on functionality supports (e.g., attribute ltering and

multi-vector search) and execution eciency (e.g., SIMD and cache

optimizations). However, vector database applications have dier-

ent requirements in the following three aspects, which motivates

us to restructure

Manu

from scratch with focuses on a cloud-native

architecture.

•

Support for complex transactions is not necessary. Instead

of decomposing entity representations into dierent elds or

tables, learning-based models encode complex and hybrid data

semantics into a single vector. As a result, multi-row or multi-

table transactions are not necessary; row-level ACID is sucient

for the majority of vector database applications.

•

A tunable performance-consistency trade-o is important.

Dierent users have dierent consistency requirements; some

users prefer high throughput and eventual consistency, while

others require some level of guaranteed consistency, i.e., newly

inserted data should be visible to queries either immediately or

within a pre-congured time. Traditional relational databases

generally support either strong consistency or eventual consis-

tency; there is little to no room for customization between these

3548

two extremes. As such, tunable consistency is a crucial attribute

for cloud-native vector databases.

•

High hardware cost calls for ne-grained elasticity. Some

vector database operations (e.g., vector search and index build-

ing) are computationally intensive, and hardware accelerators

(e.g. GPUs or FPGAs) and/or a large working memory are re-

quired for good performance. However, depending on application

types, workload diers amongst database functionalities. Thus,

resources can be wasted or improperly allocated if the vector

database does not have ne-grained elasticity. This necessitates

a careful decoupling of functional and hardware layers; system-

level decoupling such as separation of read and write logic is

insucient, elasticity and resource isolation should be managed

at the functionalities level rather than the system level.

In summary, modern vector databases should have tunable con-

sistency, functionality-level decoupling, and per-component scal-

ability. Following the design principles of traditional relational

databases makes achieving these design goals extremely dicult, if

not impossible. A key opportunity for achieving these design goals

lies in the potential for relaxing transaction complexity.

Manu

follows the “log as data” paradigm. Specically,

Manu

struc-

tures the entire system as a group of log publish/subscribe micro-

services. The write-ahead log (WAL) and inter-component mes-

sages are published as “logs", i.e., durable data streams that can be

subscribed. Read-side components, such as search and analytical

engines, are all built as log subscribers. This architecture provides

a simple yet eective way to decouple system functionalities; it

enables the decoupling of read from write, stateless from stateful,

and storage from computing. Each log entry is assigned a global

unique timestamp, and special log entries called time-tick (simi-

lar to watermarks in Apache Flink [

]) are periodically inserted

into each log channel signaling the progress of event-time for log

subscribers. The timestamp and time-tick form the basis of the

tunable consistency mechanism and multi-version consistency con-

trol (MVCC). To control the consistency level, a user can specify

a tolerable time lag between a query’s timestamp and the latest

time-tick consumed by a subscriber.

Additionally, we extensively optimize

Manu

for performance

and usability.

Manu

supports various indexes for vector search, in-

cluding vector quantization [

], inverted index [

], and

proximity graphs [

]. In particular, we tailor the implementations

to better utilize the parallelization capabilities of modern CPUs

and GPUs along with the improved read/write speeds of SSDs over

HDDs.

Manu

also integrates refactored functionalities from Mil-

vus [

], such as attribute ltering and multi-vector search. More-

over, build a visualization tool that allows users to track the perfor-

mance of

Manu

in real time and include an auto-conguration tool

that recommends indexing algorithm parameters using machine

learning.

To summarize, this paper makes the following contributions:

•

We summarize lessons learned from communicating with over

1200 industry users over three years. We shed light on typical

application requirements of vector databases and show how they

dier from those of traditional relational databases. We then

outline the key design goals that vector databases should meet.

•

We introduce

Manu

’s key architectural designs as a cloud native

vector database, building around the core design philosophy of

relaxing transaction complexity in exchange for tunable consis-

tency and ne-grained elasticity.

•

We present important usability and performance-related en-

hancements, e.g., high-level API, a GUI tool, automatic parameter

conguration, and SSD support.

The rest of the paper is organized as follows. Section 2 pro-

vides background on the requirements and design goals of vector

databases. Section 3 dives deep into

Manu

’s design. Section 4 high-

lights the key features for usability and performance. Section 5

discusses representative use cases for

Manu

. Section 6 review re-

lated works. Section 7 concludes the paper and outlines future

work.

2 BACKGROUND AND MOTIVATION

Consider video recommendation as a typical use case of vector

databases. The goal is to help users discover new videos based on

their personal preferences and previous browsing history. Using

machine learning models (especially deep neural networks), fea-

tures of users and videos, such as search history, watch history,

age, gender, video language, and tags are converted to embedding

vectors. These models are carefully designed and trained to encode

the similarity between user and video vectors into a common vec-

tor space. Recommendation is conducted by retrieving candidate

videos from the collection of video vectors via similarity scores

with respect to the specied user vector. The system also needs

to handle updates to vectors when new videos are updated, some

videos are deleted and the embedding model is changed.

Video recommendation and other applications of vector databases

can involve hundreds of billions of vectors with daily growth at

hundred-million scale, and serve million-scale queries per second

(QPS). Existing DBMSs (e.g., relational databases [

], NoSQL [

], NewSQL [

]) were not built to manage vector data on that

scale. Moreover, the underlying data management requirements of

their applications dier greatly from vector database applications.

First, when compared with relational databases, both the archi-

tecture and theory of vector databases are far from mature. A key

reason for this is that AI- and data-driven applications are still

in a state of constant evolution, thereby necessitating continued

architectural and functionality changes to vector databases as well.

Second, complex transactions are unnecessary for vector databases.

In the above example, the recommendation system encodes all se-

mantic features of users and videos into standalone vectors as

opposed to multi-row or multi-column entity elds in a relational

database. As a result, row-level ACID is sucient; multi-table oper-

ations (such as joins) are inessential.

Third, vector database applications need a exible performance-

consistency trade-o. While some applications adopt a strong or

eventual consistency model, there are others that fall between the

two extremes. Users may wish to relax consistency constraints in

exchange for better system throughput. In the video recommen-

dation example, observing a newly uploaded video after several

seconds is acceptable but keeping users waiting for recommenda-

tion harms user experience. Thus, the application can congure the

3549

of 14

免费下载

关注

评论