
Conference SIGMOD ’24, June 09–15, 2024, Santiago, Chile Jiake Ge and Huanchen Zhang, et al.
Figure 1: Write-only performance of state-of-the-art learned
indexes on the FACE dataset [
11
]. The evaluation is con-
ducted on a two-socket machine with two 16-core CPUs.
As shown in Figure 1, indexes with a buer-based strategy (i.e.,
XIndex and FINEdex) exhibit inferior base performance and worse
scalability compared to those with a model-based strategy (i.e.,
ALEX+ and LIPP+
1
). This shows that a larger margin of predic-
tion errors prevents scaling because the concurrent “last mile”
searches saturate the memory bandwidth quickly which becomes
the system’s bottleneck [38].
The problem is solved in LIPP+ where each position prediction
is accurate (i.e., no “last mile” search). However, LIPP+ requires
maintaining statistics, such as access counts and collision counts,
in each node to trigger node retraining to prevent performance
degradation. These per-node counters create high contention
among threads and cause severe cacheline ping-pong [
38
]. The
model-based strategy in ALEX+ requires shifting existing entries
upon an insert collision. Therefore, ALEX+ must acquire coarse-
grained write locks for this operation. As the number of threads
increases, more and more threads are blocked, waiting for those
exclusive locks.
In this paper, we propose SALI, the Scalable Adaptive Learned
Index framework based on probability models to solve the scalabil-
ity issues in existing solutions. To solve the scalability bottleneck
of maintaining per-node statistics, we developed lightweight prob-
ability models that can trigger node retraining and other structural
evolving operations in SALI with accurate timing (as if the timing
were determined by accurate statistics). In addition, we developed
a set of node-evolving strategies, including expanding an insert-
heavy node to contain more gaps, attening the tree structure
for frequently-accessed nodes, and compacting the rarely-touched
nodes to save memory. SALI applies these node-evolving strate-
gies adaptively according to the probability models so that it can
self-adjust to changing workloads while maintaining excellent scal-
ability. Finally, SALI adopts the learned index structure that utilizes
ne-grained write locks, i.e., LIPP+, to validate the eectiveness of
the aforementioned two strategies. Note that the lightweight proba-
bility models and node-evolving strategies are highly versatile and
can be applied to various index scenarios, as detailed in Section 5.
Our microbenchmark with real-world data sets shows that SALI
improves the insertion throughput with 64 threads by 2
.
04
×
on
average compared to the second-best learned index, i.e., ALEX+,
while achieving a lookup throughput comparable to LIPP+.
We make three primary contributions in this paper. Firstly, we
proposed SALI, a high-concurrency learned index framework de-
signed to improve the scalability of learned indexes. Secondly, we
1
ALEX+ and LIPP+ are concurrent implementations of ALEX and LIPP, respec-
tively [38].
Figure 2: The scheme of the learned index.
dened a set of node-evolving strategies in addition to model re-
training to allow the learned index to self-adapt to dierent work-
load skews. Thirdly, we replaced the per-node statistics in existing
learned indexes with lightweight probability models to remove the
scalability bottleneck of statistics maintenance while keeping the
timing accuracy of node retraining/evolving. Finally, we proved
the eectiveness of the proposed approaches by showing that SALI
outperforms the SOTA learned indexes under high concurrency.
The rest of this paper is organized as follows. Section 2 sum-
marizes the basics of learned indexes and further motivates the
scalability problem. Section 3 introduces the structure of SALI with
an emphasis on the node-evolving strategies and the probability
models. Section 4 presents our experimental results. Section 5 dis-
cusses the generalizability of the node-evolving strategies and the
probability models, as well as the limitations of SALI, followed
by a related work discussion in Section 6 Section 7 concludes the
paper. The source code of this paper has been made available at
https://github.com/YunWorkshop/SALI.
2 BACKGROUND AND MOTIVATION
2.1 The principle of learned indexes
The core concept of the learned index is to employ a set of learning
models to estimate the cumulative distribution function (CDF) of
the stored data [
29
], allowing for the prediction of the data’s storage
location, as depicted in the CDF diagram on Figure 2.
Figure 2 shows the scheme for the learned index structure. Each
node, or only the lowest leaf nodes, stores the slope and intercept
of the linear function [
3
,
5
,
8
,
15
,
21
,
36
,
39
]. Each segment corre-
sponds to a linear model, which is responsible for the approximate
position of the target key. The index segments correspond to linear
models that estimate the target key’s position, eliminating the need
for multiple indirect search operations in traditional tree-based in-
dexes. This approach has the potential to improve indexed lookup
performance signicantly.
2.2 Scalable Evaluation of Learned Index
Structures
Currently, learned indexes demonstrate good performance in
single-threaded environments. However, their scalability remains
limited [
38
]. In this part, our objective is to conduct a thorough
investigation into the factors that contribute to the concurrent
performance bottlenecks in existing learned indexes. To achieve
this, we begin by introducing the insertion strategies employed in
learned indexes, along with their corresponding index structures,
as these design choices signicantly inuence the concurrent per-
formance of the indexes. Additionally, we conduct a comprehensive
2
评论