2021-7_ResTune Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases _Xinyi Zhang.pdf

Solo

133

13页

4次

2022-01-27

免费下载

ResTune: Resource Oriented Tuning Boosted by Meta-Learning

for Cloud Databases

Xinyi Zhang

∗†‡

Peking University &

Alibaba Group

zhang_xinyi@pku.edu.cn

Hong Wu

∗‡

Alibaba Group

hong.wu@alibaba-inc.com

Zhuo Chang

‡§

Alibaba Group & Peking

University

z.chang@pku.edu.cn

Shuowei Jin

‡

Alibaba Group

shuowei.jsw@alibaba-

inc.com

Jian Tan

‡

Alibaba Group

j.tan@alibaba-inc.com

Feifei Li

‡

Alibaba Group

lifeifei@alibaba-inc.com

Tieying Zhang

‡

Alibaba Group

tieying.zhang@alibaba-

inc.com

Bin Cui

†§¶

Peking University

bin.cui@pku.edu.cn

ABSTRACT

Modern database management systems (DBMS) contain tens to

hundreds of critical performance tuning knobs that determine the

system runtime behaviors. To reduce the total cost of ownership,

cloud database providers put in drastic eort to automatically opti-

mize the resource utilization by tuning these knobs. There are two

challenges. First, the tuning system should always abide by the ser-

vice level agreement (SLA) while optimizing the resource utilization,

which imposes strict constrains on the tuning process. Second, the

tuning time should be reasonably acceptable since time-consuming

tuning is not practical for production and online troubleshooting.

In this paper, we design ResTune to automatically optimize

the resource utilization without violating SLA constraints on the

throughput and latency requirements. ResTune leverages the tun-

ing experience from the history tasks and transfers the accumulated

knowledge to accelerate the tuning process of the new tasks. The

prior knowledge is represented from historical tuning tasks through

an ensemble model. The model learns the similarity between the

historical workloads and the target, which signicantly reduces

the tuning time by a meta-learning based approach. ResTune can

eciently handle dierent workloads and various hardware en-

vironments. We perform evaluations using benchmarks and real

world workloads on dierent types of resources. The results show

that, compared with the manually tuned congurations, ResTune

∗

Xinyi Zhang and Hong Wu contribute equally to this paper.

†

Center for Data Science, Peking University & National Engineering Laboratory for

Big Data Analysis and Applications

‡

Database and Storage Laboratory, Damo Academy, Alibaba Group

School of EECS & Key Laboratory of High Condence Software Technologies, Peking

University

Institute of Computational Social Science, Peking University (Qingdao)

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SIGMOD ’21, June 20–25, 2021, Virtual Event, China

ACM ISBN 978-1-4503-8343-1/21/06.. . $15.00

https://doi.org/10.1145/3448016.3457291

reduces 65%, 87%, 39% of CPU utilization, I/O and memory on av-

erage, respectively. Compared with the state-of-the-art methods,

ResTune nds better congurations with up to ∼ 18× speedups.

CCS CONCEPTS

• Information systems → Autonomous database administra-

tion; • Computing methodologies → Machine learning.

KEYWORDS

resource; tuning; cloud database; service level agreement

ACM Reference Format:

Xinyi Zhang, Hong Wu, Zhuo Chang, Shuowei Jin, Jian Tan, Feifei Li,

Tieying Zhang, and Bin Cui. 2021. ResTune: Resource Oriented Tuning

Boosted by Meta-Learning for Cloud Databases. In Proceedings of the 2021

International Conference on Management of Data (SIGMOD ’21), June 20–

25, 2021, Virtual Event, China. ACM, New York, NY, USA, 13 pages. https:

//doi.org/10.1145/3448016.3457291

1 INTRODUCTION

Tuning conguration knobs of modern database management sys-

tems (DBMS) is critical for system performance, albeit challenging.

Dierent knobs directly aect the running database performance

and jointly determine the quality of service and the resource uti-

lization of DBMS. As a common practice, to apply an appropriate

conguration for a given workload, database administrators (DBAs)

are responsible for tuning these knobs based on experience. How-

ever, in a cloud environment, manually tuning possibly tens to

hundreds of controlling knobs do not guarantee the performance

across various workloads and could not scale. Therefore, automatic

tuning becomes an appealing feature for cloud providers.

On one hand, optimizing the system performance (e.g., through-

put, latency) is critical to improving users’ experience. On the other

hand, controlling the resource utilization is a necessity from the

cloud provider’s perspective, due to the following reasons. First,

one of the goals of using cloud databases is to reduce the Total Cost

of Ownership (TCO). Maintaining a low cost is an important eco-

nomic factor to attract users, which urges to more eciently utilize

the available computing resources. Second, optimizing computing

resources such as CPU, memory, and I/O helps troubleshoot perfor-

mance bugs that cause unnecessary high utilization. High resource

Research Data Management Track Paper

SIGMOD ’21, June 20–25, 2021, Virtual Event, China

2102

1724

3448

5172

6896

8620

sync_spin_loops

1413

2825

4237

5650

7062

8474

9886

table_open_cache

Throughput (txn/sec)

1724

3448

5172

6896

8620

sync_spin_loops

1413

2825

4237

5650

7062

8474

9886

table_open_cache

CPU Utilization (%)

10K

Figure 1: TPS and CPU Usage for Real Workload with 2 Knobs

utilization often leads to unpredictable system hangs and resource

contentions in a shared or multi-tenant environment [

]. For

example, high CPU utilization is a frequent issue that aects the

availability of cloud databases [

]. Third, the throughput of real

workloads is often bounded by the request rate determined by

the clients. Thus, the request rates do not necessarily reach the

processing capacity of DBMS. For these common application sce-

narios, squeezing more throughput from the capacity is not the

goal. Meanwhile, controlling resource utilization is more valuable

for end-users, which can help them to choose appropriate cloud

instance types and to further avoid over-provisioning.

One challenge of tuning conguration knobs is to reduce re-

source utilization while still guaranteeing the Service Level Agree-

ment (SLA), e.g., without violating the throughput and latency

requirements. Figure 1 plots the throughput along with CPU usage

on a real workload with 2 controlling knobs, i.e., the number of

open tables

and the number of times a thread waits for the mu-

tex to be freed before suspending

. The result shows that, even

though a wide range of congurations has dierent CPU usages,

they experience the same throughput. As mentioned earlier, the

throughput of real workloads is often bounded by the user request

rate. Therefore, there are opportunities to optimize resource uti-

lization without sacricing the SLA. Most existing database tuning

methods [

] mainly focus on improving the through-

put and latency without optimizing the resource usage and SLA

simultaneously. For example, iTuned [

] and OtterTune [

] use

Gaussian Processes to tune knobs to achieve only high throughputs.

CDBTune [

] and QTune [

] use the reinforcement learning ap-

proach to train a policy model to recommend good knobs, which,

however, takes a long time to learn the model [23].

The other challenge is to satisfy the natural constraint imposed

by the real applications that often limit the required tuning times.

Tuning systems replay the workload repeatedly to learn the model

iteratively, and the replay times dominate the tuning process. The

state-of-the-art systems [

] take hundreds to thousands of iter-

ations to nd an ideal conguration. For typical benchmarks that

assume the transaction statistics do not change over time, the re-

play time can be set to 3-5 minutes [

]. But for real workloads,

we observe that the replay time for each iteration takes at least

5 minutes to adapt to dierent types of transactions. This could

cause the total tuning time for real workloads to last for a few days.

This issue is more pronounced when considering that tuning itself

requires computing resources such as DBMS copies to replay on the

MySQL knob: table_open_cache

MySQL knob: innodb_sync_spin_loops

user side (Section 4). Thus, the tuning time should be minimized.

In addition, tuning DBMS systems, e.g., reducing the high resource

utilization, can be used for online performance troubleshooting.

High utilization could have a severe impact on system availability.

From this point of view, the tuning time should match the typi-

cal system recovery time, which is often from a few minutes to 1

hour [

]. To accelerate the tuning process by reducing the budget

to tens of iterations, ResTune utilizes the historical data collected

from tuning other tasks and transfer the experience into tuning

new tasks. This requires the tuning algorithm to eciently and

eectively represent useful knowledge from historical tuning data.

Our Approach

. Dierent from previous works that only consider

the throughput and latency, in this paper, we dene the resource-

oriented tuning problem that aims to nd the congurations to

minimize the resource usage without sacricing the throughput

and latency. We formulate it as a constrained optimization problem

and propose ResTune, a constraint-aware database tuning system

boosted by meta-learning. ResTune is a tool provided by the cloud

providers, which aims to reduce the Total Cost of Ownership for

its end users. It optimizes the resource utilization for a given work-

load by imposing constraints on the performance requirements.

ResTune models both the objective function and the constraints

using Gaussian processes to recommend congurations with op-

timized resource utilization while guaranteeing the SLA. To im-

prove the eciency of ResTune, we use meta-learning, which is the

method of systematically learning from meta-data to accomplish

new tasks [

]. A novel meta-learning pipeline is proposed to use

multiple models (base-learners) to represent prior knowledge and

an ensemble model (meta-learner) to combine and eectively uti-

lize the experiences. The meta-learner measures the usefulness of

base-learners to target workload through meta-feature and model

prediction. In this way, ResTune could accordingly make use of

existing data and accelerate the tuning process. Furthermore, our

approach can transfer the knowledge over dierent workloads and

heterogeneous hardware environments.

Specically, we make the following contributions:

•

To deal with the challenges in real DBMS scenarios, we formu-

late the resource-oriented conguration tuning problem as a

constrained Bayesian Optimization problem.

•

To accelerate the tuning process within an acceptable time inter-

val, a meta-learning strategy is proposed to extract experience

from past tasks. Unlike previous studies, our approach uses rel-

ative rankings rather than absolute distances to measure the

similarity between workloads. It can better transfer knowledge

across dierent hardware environments and achieve fast tuning

and ecient adaptation. To the best of our knowledge, this is the

rst attempt to boost constrained Bayesian Optimization with

meta-learning for tuning DBMS.

•

We implement the proposed method and evaluate on standard

benchmarks and real workloads. Compared with the manual con-

gurations provided by the DBAs, ResTune reduces 65% of CPU

utilization, 87% of I/O, and 39% of memory on average. Compared

with the state-of-the-art DBMS tuning systems, ResTune nds

better congurations with up to ∼ 18× speedups.

The remainder of the paper is organized as follows. Section

2 provides the related work and Section 3 formally denes the

Research Data Management Track Paper

SIGMOD ’21, June 20–25, 2021, Virtual Event, China

2103

of 13

免费下载

关注

评论