暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
VLDB2024_Leveraging Dynamic and Heterogeneous Workload Knowledge to Boost the Performance of Index Advisors_华为.pdf
553
13页
3次
2024-09-09
免费下载
Leveraging Dynamic and Heterogeneous Workload Knowledge to
Boost the Performance of Index Advisors
Zijia Wang
Haoran Liu
Chen Lin
School of Informatics,
Xiamen University
China
chenlin@xmu.edu.cn
zjwang@stu.xmu.edu.cn
liuhr@stu.xmu.edu.cn
Zhifeng Bao
RMIT University
Australia
zhifeng.bao@rmit.edu.au
Guoliang Li
Department of Computer
Science, Tsinghua
University
China
liguoliang@tsinghua.edu.cn
Tianqing Wang
Huawei Company
China
wangtianqing2@huawei.com
ABSTRACT
Current index advisors often struggle to balance eciency and
eectiveness when dealing with workload shifts. This arises from
ignorance of the continual similarity and distant variety in work-
loads. This paper proposes a novel learning-based index advisor
called BALANCE, which boosts indexing performance by leverag-
ing knowledge obtained from dynamic and heterogeneous work-
loads. Our approach consists of three components. First, we build
separate Lightweight Index Advisors (LIAs) on sequential chunks
of similar workloads, where each LIA is trained with a small batch
of workloads drawn from the chunk, and it provides direct index
recommendations for all workloads in the same chunk. Second,
we perform a policy transfer mechanism by adapting the LIA’s
index selection strategy from historical knowledge, substantially
reducing the training overhead. Third, we employ a self-supervised
contrastive learning method to provide an o-the-shelf workload
representation, enabling the LIA to generate more accurate index
recommendations. Extensive experiments across various bench-
marks demonstrate that BALANCE improves the state-of-the-art
learning-based index advisor, SWIRL, by 10
.
03% while reducing
training overhead by 35.70% on average.
PVLDB Reference Format:
Zijia Wang, Haoran Liu, Chen Lin, Zhifeng Bao, Guoliang Li, and Tianqing
Wang. Leveraging Dynamic and Heterogeneous Workload Knowledge to
Boost the Performance of Index Advisors. PVLDB, 17(7): 1642 - 1654, 2024.
doi:10.14778/3654621.3654631
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at
https://github.com/XMUDM/BALANCE.
1 INTRODUCTION
Selecting appropriate attributes to build indexes can accelerate
data retrieval at the cost of storage overhead and index mainte-
nance [
27
]. This can be formalized as the Index Selection Problem
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 7 ISSN 2150-8097.
doi:10.14778/3654621.3654631
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
Time (hours)
c
1
c
2
c
3
c
4
c
5
c
6
c
7
c
8
c
9
Access Columns
0
5
10
15
20
25
Frequency
Figure 1: An illustrative example of the frequency of columns
accessed by the workload every hour
(ISP), which is to select the set of indexes for a given workload to
maximize the workload’s performance while satisfying the storage
cost constraints.
The ISP has been proven to be NP-hard [
28
], and many index ad-
visors have been proposed to solve it. Early Index Advisors (IAs) are
usually heuristic-based [
3
,
4
,
6
,
12
,
32
,
36
,
37
], which use predened
rules to add or remove indexes to the output index conguration
iteratively. Recently, learning-based IAs have emerged, and they are
mostly based on deep Reinforcement Learning (RL) [
17
,
18
,
22
,
31
]
or Monte Carlo tree search [
38
,
41
] to make index selection deci-
sions.
However, in real-world scenarios, workloads evolve over time.
Existing IAs struggle to deliver eective index congurations e-
ciently for dynamic and heterogeneous workloads due to the costly
tuning overhead (e.g., time cost to train IAs, time cost to determine
index selections, etc.). We argue that they neglect two important
temporal properties of workloads, i.e., continual similarity and dis-
tant variety.
The continual similarity refers to situations where the workload
experiences relatively minor changes in the short term. As shown
in Figure 1, workloads from hour 1 to hour 8 have similar query
patterns, e.g., access columns
𝑐
1
, 𝑐
2
, 𝑐
5 with the same frequency.
This property has been observed in many production database sys-
tems. For example, BusTracker
1
is a mobile phone application for
live-tracking of the public transit bus system. On weekdays during
commuting hours, the workloads contain many commute-related
queries, such as checking bus arrival times, planning routes to de-
sired destinations, etc. Another example is App-X [
34
] for trading
on the stock exchange. During the opening bell, most queries are
value-based, such as stock buyers searching for the current price of
specic stocks or checking market trends.
1
http://www.cs.cmu.edu/~malin199/data/tiramisu-sample
1642
The distant variety pertains to situations where the workload
undergoes signicant evolution over the long term. As shown in Fig-
ure 1, workloads from hour 9 to hour 16 access columns
𝑐
2
, 𝑐
3
, 𝑐
6
, 𝑐
8,
which is a dierent set of columns than hour 1
8. Distant variety
is also observed in real scenarios, and it can result from a shift in
user intent or workload evolution. For example, Bustracker queries
during weekends are more activity-related, such as users exploring
bus schedules and alternative routes for leisure activities. In the
case of App-X [
34
], queries during the market closing time are more
technology-based, such as the securities company sta perform-
ing transaction monitoring of specic stock categories. Moreover,
workload shift is observed [
23
] in MOOC
2
whenever new fea-
tures are released because more queries exploring and utilizing new
functionalities emerge.
The continual similarity and distant variety raise two questions
for learning-based index advisors.
First, existing learning-based index advisors [
17
,
18
,
31
] often
overlook continual similarity and need to implement expensive
trials to t each workload. A natural question arises: can we train the
IA with a small batch of samples and use it to directly predict indexes
for similar workloads in a certain period of time? Since workloads
in adjacent hours have common query patterns and demands, they
would benet from employing similar index selection strategies.
Here, we emphasize that directly recommending indexes for similar
workloads is non-trivial because the indexing policy is based on
accurate workload representation that captures the subtle, index-
aware dierences in similar workloads.
Second, when workloads evolve, existing learning-based index
advisors utilize only the index trials on the current workload and
discard potentially valuable historical index selection experiences
on distant workloads in the past. Can we improve the learning ef-
ciency and reduce the training overhead of index advisors by fully
exploiting historical samples on past workloads? Since machine learn-
ing models can improve themselves through training experiences,
it is plausible that with appropriate treatments, past experiences
can give the index advisor a better initialization on the current
workload, and the index advisor will converge to the optimal index
selection faster. Nonetheless, integrating historical experiences into
the current training process is challenging because of the distant
variety. Clearly, the index selection strategies need to adapt and
undergo considerable changes to cater to the varied requirements of
distant workloads. The indexing policy must be carefully designed
to transfer knowledge from past experiences without negatively
impacting the current workload.
Most RL-based index advisors take a long training time to achieve
suciently good index recommendations, primarily because they
overlook the importance of continual similarity and distant variety
in their training process. For example, SWIRL [
17
] demonstrates
the state-of-the-art eectiveness for dynamic workloads. Yet, it
still requires approximately of training duration on the TPC-H
10GB dataset. By eectively leveraging the continual similarity and
distant variety, as we will show shortly, we only need 3.3 min of
training duration to achieve the same indexing performance.
2
https://www.mooc.org/
Our Contributions. To address the above issues, we propose
BALANCE
: a transfer RL-based index advisor for dynamic work-
loads in real scenarios.
First,
BALANCE
builds separate Lightweight Index Advisors
(LIAs) on sequential chunks of similar workloads. Each LIA is
trained with a small batch of samples drawn from the chunk, in-
stead of the whole chunk (i.e., lightweight). It can provide direct
index recommendations for other workloads in the same chunk
(Section 3.2).
BALANCE
achieves comparable indexing results to
the near-optimal Extend [
32
] with less than 0
.
6% inference runtime.
Second,
BALANCE
presents a policy transfer mechanism to
adapt the current LIA from previously trained LIAs based on work-
load similarities. Thus, knowledge learned from distant workloads
is transferred without introducing noise or fusing dissimilar work-
loads, and the training eciency of reinforcement learning is im-
proved (Section 5).
BALANCE
improves SWIRL by 10
.
03% while
reducing the training overhead by 35.70% on average.
Third,
BALANCE
employs a self-supervised contrastive learning
method before training LIAs to obtain o-the-shelf workload repre-
sentations. Thus, the workload representations are obtained more
eciently without actually implementing the time-costly reinforce-
ment learning trials. Furthermore, the workload representations
reveal key characteristics of the workloads related to indexing per-
formance and enable LIAs to produce more reliable and accurate
index recommendations (Section 4). Ablation study on
BALANCE
shows that, compared with workload representations extracted
from query text [
31
], the self-supervised workload representation
can reduce workload execution cost by up to 6.6%.
2 RELATED WORK
2.1 Index Advisor
Some heuristic-based index advisors reduce a comprehensive set of
initial index candidates step by step [
3
,
37
]. These methods often
lead to excessively long runtime because many iterations are re-
quired to satisfy the specied budget [
16
]. Other works add indexes
iteratively to an empty set, where the indexes can be single-column
indexes [12] or multi-column indexes [4, 6, 32, 36].
Recently, learning-based methods based on reinforcement learn-
ing [
35
] have shown great potential in both eciency and accuracy.
Their dierence mainly lies in state representations, which can be (1)
workload-independent [
27
,
41
], e.g., treating potential index combi-
nations as tree nodes and fetching index congurations via Monte
Carlo Tree Search (MCTS); (2) query level [
18
], e.g., including the
frequency of queries; (3) column level [
30
,
31
], e.g., incorporating
a selectivity vector of each attribute; or (4) plan level [
17
], e.g.,
incorporating query operators parsed from the execution plan.
As per [
16
], heuristic methods are shown to work poorly for large
databases and complex workloads because such methods cannot
balance between high inference eciency and high index quality.
As per [
17
], Extend [
32
] produces the best workload execution cost,
and SWIRL [
17
] makes comparable workload cost reduction while
signicantly reducing inference time and training overhead.
2.2 Transfer Reinforcement Learning
Reinforcement learning faces the problem of sparse feedback and
sample ineciency, especially in high-complexity state and action
1643
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜