VLDB2024_SALI： A scalable adaptive learned index framework based on probability models_腾讯云.pdf

迹部景吾

17页

0次

2025-05-27

免费下载

SALI: A Scalable Adaptive Learned Index Framework based on

Probability Models

Jiake Ge

♠♥

, Huanchen Zhang

♦§

, Boyu Shi

♣♥

, Yuanhui Luo

♠♥

, Yunda Guo

♠♥

, Yunpeng Chai

♣♥∗

Yuxing Chen

, Anqun Pan

♠

Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China

♣

Engineering Research Center of Database and Business Intelligence, MOE, China

♥

School of Information, Renmin University of China

China and Shanghai Qi Zhi Institute

♦

Tsinghua University

Tencent Inc.

gejiake@ruc.edu.cn, huanchen@tsinghua.edu.cn, shiboyu5687@ruc.edu.cn, losk@ruc.edu.cn,

guoyunda@ruc.edu.cn, ypchai@ruc.edu.cn, axingguchen@tencent.com, aaronpan@tencent.com

ABSTRACT

The growth in data storage capacity and the increasing demands for

high performance have created several challenges for concurrent

indexing structures. One promising solution is learned indexes,

which use a learning-based approach to t the distribution of stored

data and predictively locate target keys, signicantly improving

lookup performance. Despite their advantages, prevailing learned

indexes exhibit constraints and encounter issues of scalability on

multi-core data storage.

This paper introduces SALI, the Scalable Adaptive Learned Index

framework, which incorporates two strategies aimed at achieving

high scalability, improving eciency, and enhancing the robust-

ness of the learned index. Firstly, a set of node-evolving strategies

is dened to enable the learned index to adapt to various work-

load skews and enhance its concurrency performance in such sce-

narios. Secondly, a lightweight strategy is proposed to maintain

statistical information within the learned index, with the goal of

further improving the scalability of the index. Furthermore, to vali-

date their eectiveness, SALI applied the two strategies mentioned

above to the learned index structure that utilizes ne-grained write

locks, known as LIPP. The experimental results have demonstrated

that SALI signicantly enhances the insertion throughput with 64

threads by an average of 2

compared to the second-best learned

index. Furthermore, SALI accomplishes a lookup throughput similar

to that of LIPP+.

CCS CONCEPTS

• Information systems → Data access methods.

∗

Yunpeng Chai is the corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

Conference SIGMOD ’24, June 09–15, 2024, Santiago, Chile

ACM ISBN 978-1-4503-XXXX-X/18/06. .. $15.00

https://doi.org/XXXXXXX.XXXXXXX

KEYWORDS

Learned Index; Probability Models; Adaptive Index

ACM Reference Format:

Jiake Ge

♠♥

, Huanchen Zhang

♦§

, Boyu Shi

♣♥

, Yuanhui Luo

♠♥

, Yunda Guo

♠♥

Yunpeng Chai

♣♥∗

, Yuxing Chen

, Anqun Pan

. 2024. SALI: A Scalable Adap-

tive Learned Index Framework based on Probability Models. In Proceedings

of In Proceedings of the 2024 International Conference on Management of Data

(Conference SIGMOD ’24). ACM, Santiago, Chile, 17 pages.

1 INTRODUCTION

With the exponential growth of the data volume today, ecient

indexing data structures are crucial for a big data system to sup-

port timely information retrieval. To improve the performance and

memory eciency of traditional tree-based indexes, Kraska et al.

introduced a learned index, called the Recursive Model Indexes

(RMI) that uses machine learning models to replace the internal

nodes of a B+tree [

]. An outstanding problem of the original

RMI is that it is static: inserting or updating a key in the index

requires a signicant portion of the data structure to rebuild, thus

limiting the use cases of the learned index.

Previous work has proposed two strategies to address the up-

datability issue of learned indexes. The rst (i.e., the buer-based

strategy) is to accommodate new entries in separate insert buers

rst to amortize the index reconstruction cost. XIndex [

] and

FINEdex [

] fall into this category. The other strategy (i.e., the

model-based strategy) adopted by ALEX [

] and LIPP [

] is to re-

serve slot gaps within nodes to handle new entries with an in-place

insertion. Upon an insert collision (i.e., the mapped slot is already

occupied), ALEX reorganizes the node by shifting the existing en-

tries, while LIPP utilizes a chaining scheme, creating a new node for

the corresponding slot to transform the last-mile search problem

into a sub-tree traversal problem.

We found, however, that none of the above index designs scale at

a high concurrency. We performed an experiment where we insert

200 million random integer keys into a learned index, with a varying

number of threads each time. Figure 1 shows the results. Note that

the number of threads in the grey area of the gure is larger than

the number of hardware threads of the machine. This is common

in practice as a database/key-value server typically handles a large

number of user connections simultaneously.

arXiv:2308.15012v2 [cs.DB] 5 Sep 2023

Conference SIGMOD ’24, June 09–15, 2024, Santiago, Chile Jiake Ge and Huanchen Zhang, et al.

Figure 1: Write-only performance of state-of-the-art learned

indexes on the FACE dataset [

]. The evaluation is con-

ducted on a two-socket machine with two 16-core CPUs.

As shown in Figure 1, indexes with a buer-based strategy (i.e.,

XIndex and FINEdex) exhibit inferior base performance and worse

scalability compared to those with a model-based strategy (i.e.,

ALEX+ and LIPP+

). This shows that a larger margin of predic-

tion errors prevents scaling because the concurrent “last mile”

searches saturate the memory bandwidth quickly which becomes

the system’s bottleneck [38].

The problem is solved in LIPP+ where each position prediction

is accurate (i.e., no “last mile” search). However, LIPP+ requires

maintaining statistics, such as access counts and collision counts,

in each node to trigger node retraining to prevent performance

degradation. These per-node counters create high contention

among threads and cause severe cacheline ping-pong [

]. The

model-based strategy in ALEX+ requires shifting existing entries

upon an insert collision. Therefore, ALEX+ must acquire coarse-

grained write locks for this operation. As the number of threads

increases, more and more threads are blocked, waiting for those

exclusive locks.

In this paper, we propose SALI, the Scalable Adaptive Learned

Index framework based on probability models to solve the scalabil-

ity issues in existing solutions. To solve the scalability bottleneck

of maintaining per-node statistics, we developed lightweight prob-

ability models that can trigger node retraining and other structural

evolving operations in SALI with accurate timing (as if the timing

were determined by accurate statistics). In addition, we developed

a set of node-evolving strategies, including expanding an insert-

heavy node to contain more gaps, attening the tree structure

for frequently-accessed nodes, and compacting the rarely-touched

nodes to save memory. SALI applies these node-evolving strate-

gies adaptively according to the probability models so that it can

self-adjust to changing workloads while maintaining excellent scal-

ability. Finally, SALI adopts the learned index structure that utilizes

ne-grained write locks, i.e., LIPP+, to validate the eectiveness of

the aforementioned two strategies. Note that the lightweight proba-

bility models and node-evolving strategies are highly versatile and

can be applied to various index scenarios, as detailed in Section 5.

Our microbenchmark with real-world data sets shows that SALI

improves the insertion throughput with 64 threads by 2

average compared to the second-best learned index, i.e., ALEX+,

while achieving a lookup throughput comparable to LIPP+.

We make three primary contributions in this paper. Firstly, we

proposed SALI, a high-concurrency learned index framework de-

signed to improve the scalability of learned indexes. Secondly, we

ALEX+ and LIPP+ are concurrent implementations of ALEX and LIPP, respec-

tively [38].

Figure 2: The scheme of the learned index.

dened a set of node-evolving strategies in addition to model re-

training to allow the learned index to self-adapt to dierent work-

load skews. Thirdly, we replaced the per-node statistics in existing

learned indexes with lightweight probability models to remove the

scalability bottleneck of statistics maintenance while keeping the

timing accuracy of node retraining/evolving. Finally, we proved

the eectiveness of the proposed approaches by showing that SALI

outperforms the SOTA learned indexes under high concurrency.

The rest of this paper is organized as follows. Section 2 sum-

marizes the basics of learned indexes and further motivates the

scalability problem. Section 3 introduces the structure of SALI with

an emphasis on the node-evolving strategies and the probability

models. Section 4 presents our experimental results. Section 5 dis-

cusses the generalizability of the node-evolving strategies and the

probability models, as well as the limitations of SALI, followed

by a related work discussion in Section 6 Section 7 concludes the

paper. The source code of this paper has been made available at

https://github.com/YunWorkshop/SALI.

2 BACKGROUND AND MOTIVATION

2.1 The principle of learned indexes

The core concept of the learned index is to employ a set of learning

models to estimate the cumulative distribution function (CDF) of

the stored data [

], allowing for the prediction of the data’s storage

location, as depicted in the CDF diagram on Figure 2.

Figure 2 shows the scheme for the learned index structure. Each

node, or only the lowest leaf nodes, stores the slope and intercept

of the linear function [

]. Each segment corre-

sponds to a linear model, which is responsible for the approximate

position of the target key. The index segments correspond to linear

models that estimate the target key’s position, eliminating the need

for multiple indirect search operations in traditional tree-based in-

dexes. This approach has the potential to improve indexed lookup

performance signicantly.

2.2 Scalable Evaluation of Learned Index

Structures

Currently, learned indexes demonstrate good performance in

single-threaded environments. However, their scalability remains

limited [

]. In this part, our objective is to conduct a thorough

investigation into the factors that contribute to the concurrent

performance bottlenecks in existing learned indexes. To achieve

this, we begin by introducing the insertion strategies employed in

learned indexes, along with their corresponding index structures,

as these design choices signicantly inuence the concurrent per-

formance of the indexes. Additionally, we conduct a comprehensive

of 17

免费下载

关注

评论