VLDB2024_Leveraging Dynamic and Heterogeneous Workload Knowledge to Boost the Performance of Index Advisors_华为.pdf

迹部景吾

613

13页

3次

2024-09-09

免费下载

Leveraging Dynamic and Heterogeneous Workload Knowledge to

Boost the Performance of Index Advisors

Zijia Wang

Haoran Liu

Chen Lin

School of Informatics,

Xiamen University

China

chenlin@xmu.edu.cn

zjwang@stu.xmu.edu.cn

liuhr@stu.xmu.edu.cn

Zhifeng Bao

RMIT University

Australia

zhifeng.bao@rmit.edu.au

Guoliang Li

Department of Computer

Science, Tsinghua

University

China

liguoliang@tsinghua.edu.cn

Tianqing Wang

Huawei Company

China

wangtianqing2@huawei.com

ABSTRACT

Current index advisors often struggle to balance eciency and

eectiveness when dealing with workload shifts. This arises from

ignorance of the continual similarity and distant variety in work-

loads. This paper proposes a novel learning-based index advisor

called BALANCE, which boosts indexing performance by leverag-

ing knowledge obtained from dynamic and heterogeneous work-

loads. Our approach consists of three components. First, we build

separate Lightweight Index Advisors (LIAs) on sequential chunks

of similar workloads, where each LIA is trained with a small batch

of workloads drawn from the chunk, and it provides direct index

recommendations for all workloads in the same chunk. Second,

we perform a policy transfer mechanism by adapting the LIA’s

index selection strategy from historical knowledge, substantially

reducing the training overhead. Third, we employ a self-supervised

contrastive learning method to provide an o-the-shelf workload

representation, enabling the LIA to generate more accurate index

recommendations. Extensive experiments across various bench-

marks demonstrate that BALANCE improves the state-of-the-art

learning-based index advisor, SWIRL, by 10

03% while reducing

training overhead by 35.70% on average.

PVLDB Reference Format:

Zijia Wang, Haoran Liu, Chen Lin, Zhifeng Bao, Guoliang Li, and Tianqing

Wang. Leveraging Dynamic and Heterogeneous Workload Knowledge to

Boost the Performance of Index Advisors. PVLDB, 17(7): 1642 - 1654, 2024.

doi:10.14778/3654621.3654631

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/XMUDM/BALANCE.

1 INTRODUCTION

Selecting appropriate attributes to build indexes can accelerate

data retrieval at the cost of storage overhead and index mainte-

nance [

]. This can be formalized as the Index Selection Problem

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 7 ISSN 2150-8097.

doi:10.14778/3654621.3654631

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

Time (hours)

Access Columns

Frequency

Figure 1: An illustrative example of the frequency of columns

accessed by the workload every hour

(ISP), which is to select the set of indexes for a given workload to

maximize the workload’s performance while satisfying the storage

cost constraints.

The ISP has been proven to be NP-hard [

], and many index ad-

visors have been proposed to solve it. Early Index Advisors (IAs) are

usually heuristic-based [

], which use predened

rules to add or remove indexes to the output index conguration

iteratively. Recently, learning-based IAs have emerged, and they are

mostly based on deep Reinforcement Learning (RL) [

]

or Monte Carlo tree search [

] to make index selection deci-

sions.

However, in real-world scenarios, workloads evolve over time.

Existing IAs struggle to deliver eective index congurations e-

ciently for dynamic and heterogeneous workloads due to the costly

tuning overhead (e.g., time cost to train IAs, time cost to determine

index selections, etc.). We argue that they neglect two important

temporal properties of workloads, i.e., continual similarity and dis-

tant variety.

The continual similarity refers to situations where the workload

experiences relatively minor changes in the short term. As shown

in Figure 1, workloads from hour 1 to hour 8 have similar query

patterns, e.g., access columns

𝑐

, 𝑐

5 with the same frequency.

This property has been observed in many production database sys-

tems. For example, BusTracker

is a mobile phone application for

live-tracking of the public transit bus system. On weekdays during

commuting hours, the workloads contain many commute-related

queries, such as checking bus arrival times, planning routes to de-

sired destinations, etc. Another example is App-X [

] for trading

on the stock exchange. During the opening bell, most queries are

value-based, such as stock buyers searching for the current price of

specic stocks or checking market trends.

http://www.cs.cmu.edu/~malin199/data/tiramisu-sample

1642

The distant variety pertains to situations where the workload

undergoes signicant evolution over the long term. As shown in Fig-

ure 1, workloads from hour 9 to hour 16 access columns

𝑐

, 𝑐

which is a dierent set of columns than hour 1

−

8. Distant variety

is also observed in real scenarios, and it can result from a shift in

user intent or workload evolution. For example, Bustracker queries

during weekends are more activity-related, such as users exploring

bus schedules and alternative routes for leisure activities. In the

case of App-X [

], queries during the market closing time are more

technology-based, such as the securities company sta perform-

ing transaction monitoring of specic stock categories. Moreover,

workload shift is observed [

] in MOOC

whenever new fea-

tures are released because more queries exploring and utilizing new

functionalities emerge.

The continual similarity and distant variety raise two questions

for learning-based index advisors.

First, existing learning-based index advisors [

] often

overlook continual similarity and need to implement expensive

trials to t each workload. A natural question arises: can we train the

IA with a small batch of samples and use it to directly predict indexes

for similar workloads in a certain period of time? Since workloads

in adjacent hours have common query patterns and demands, they

would benet from employing similar index selection strategies.

Here, we emphasize that directly recommending indexes for similar

workloads is non-trivial because the indexing policy is based on

accurate workload representation that captures the subtle, index-

aware dierences in similar workloads.

Second, when workloads evolve, existing learning-based index

advisors utilize only the index trials on the current workload and

discard potentially valuable historical index selection experiences

on distant workloads in the past. Can we improve the learning ef-

ciency and reduce the training overhead of index advisors by fully

exploiting historical samples on past workloads? Since machine learn-

ing models can improve themselves through training experiences,

it is plausible that with appropriate treatments, past experiences

can give the index advisor a better initialization on the current

workload, and the index advisor will converge to the optimal index

selection faster. Nonetheless, integrating historical experiences into

the current training process is challenging because of the distant

variety. Clearly, the index selection strategies need to adapt and

undergo considerable changes to cater to the varied requirements of

distant workloads. The indexing policy must be carefully designed

to transfer knowledge from past experiences without negatively

impacting the current workload.

Most RL-based index advisors take a long training time to achieve

suciently good index recommendations, primarily because they

overlook the importance of continual similarity and distant variety

in their training process. For example, SWIRL [

] demonstrates

the state-of-the-art eectiveness for dynamic workloads. Yet, it

still requires approximately of training duration on the TPC-H

10GB dataset. By eectively leveraging the continual similarity and

distant variety, as we will show shortly, we only need 3.3 min of

training duration to achieve the same indexing performance.

https://www.mooc.org/

Our Contributions. To address the above issues, we propose

BALANCE

: a transfer RL-based index advisor for dynamic work-

loads in real scenarios.

First,

BALANCE

builds separate Lightweight Index Advisors

(LIAs) on sequential chunks of similar workloads. Each LIA is

trained with a small batch of samples drawn from the chunk, in-

stead of the whole chunk (i.e., lightweight). It can provide direct

index recommendations for other workloads in the same chunk

(Section 3.2).

BALANCE

achieves comparable indexing results to

the near-optimal Extend [

] with less than 0

6% inference runtime.

Second,

BALANCE

presents a policy transfer mechanism to

adapt the current LIA from previously trained LIAs based on work-

load similarities. Thus, knowledge learned from distant workloads

is transferred without introducing noise or fusing dissimilar work-

loads, and the training eciency of reinforcement learning is im-

proved (Section 5).

BALANCE

improves SWIRL by 10

03% while

reducing the training overhead by 35.70% on average.

Third,

BALANCE

employs a self-supervised contrastive learning

method before training LIAs to obtain o-the-shelf workload repre-

sentations. Thus, the workload representations are obtained more

eciently without actually implementing the time-costly reinforce-

ment learning trials. Furthermore, the workload representations

reveal key characteristics of the workloads related to indexing per-

formance and enable LIAs to produce more reliable and accurate

index recommendations (Section 4). Ablation study on

BALANCE

shows that, compared with workload representations extracted

from query text [

], the self-supervised workload representation

can reduce workload execution cost by up to 6.6%.

2 RELATED WORK

2.1 Index Advisor

Some heuristic-based index advisors reduce a comprehensive set of

initial index candidates step by step [

]. These methods often

lead to excessively long runtime because many iterations are re-

quired to satisfy the specied budget [

]. Other works add indexes

iteratively to an empty set, where the indexes can be single-column

indexes [12] or multi-column indexes [4, 6, 32, 36].

Recently, learning-based methods based on reinforcement learn-

ing [

] have shown great potential in both eciency and accuracy.

Their dierence mainly lies in state representations, which can be (1)

workload-independent [

], e.g., treating potential index combi-

nations as tree nodes and fetching index congurations via Monte

Carlo Tree Search (MCTS); (2) query level [

], e.g., including the

frequency of queries; (3) column level [

], e.g., incorporating

a selectivity vector of each attribute; or (4) plan level [

], e.g.,

incorporating query operators parsed from the execution plan.

As per [

], heuristic methods are shown to work poorly for large

databases and complex workloads because such methods cannot

balance between high inference eciency and high index quality.

As per [

], Extend [

] produces the best workload execution cost,

and SWIRL [

] makes comparable workload cost reduction while

signicantly reducing inference time and training overhead.

2.2 Transfer Reinforcement Learning

Reinforcement learning faces the problem of sparse feedback and

sample ineciency, especially in high-complexity state and action

1643

of 13

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论