2020-6_Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases _Minghua Ma.pdf

Solo

14页

3次

2022-01-27

免费下载

Diagnosing Root Causes of Intermittent Slow Queries

in Cloud Databases

Minghua Ma

∗1,2

, Zheng Yin

, Shenglin Zhang

†3

, Sheng Wang

, Christopher Zheng

, Xinhao Jiang

Hanwen Hu

, Cheng Luo

, Yilin Li

, Nengjun Qiu

, Feifei Li

, Changcheng Chen

and Dan Pei

Tsinghua University, {mmh16, luo-c18, liyilin16}@mails.tsinghua.edu.cn, peidan@tsinghua.edu.cn

Alibaba Group, {yinzheng.yz, sh.wang, nengjun.qnj, lifeifei}@alibaba-inc.com

Nankai University, zhangsl@nankai.edu.cn

ABSTRACT

With the growing market of cloud databases, careful detection and

elimination of slow queries are of great importance to service sta-

bility. Previous studies focus on optimizing the slow queries that

result from internal reasons (e.g., poorly-written SQLs). In this

work, we discover a different set of slow queries which might be

more hazardous to database users than other slow queries. We name

such queries Intermittent Slow Queries (iSQs), because they usu-

ally result from intermittent performance issues that are external

(e.g., at database or machine levels). Diagnosing root causes of

iSQs is a tough but very valuable task.

This paper presents iSQUAD, Intermittent Slow QUery Anomaly

Diagnoser, a framework that can diagnose the root causes of iSQs

with a loose requirement for human intervention. Due to the com-

plexity of this issue, a machine learning approach comes to light

naturally to draw the interconnection between iSQs and root causes,

but it faces challenges in terms of versatility, labeling overhead and

interpretability. To tackle these challenges, we design four com-

ponents, i.e., Anomaly Extraction, Dependency Cleansing, Type-

Oriented Pattern Integration Clustering (TOPIC) and Bayesian Case

Model. iSQUAD consists of an ofﬂine clustering & explanation

stage and an online root cause diagnosis & update stage. DBAs

need to label each iSQ cluster only once at the ofﬂine stage un-

less a new type of iSQs emerges at the online stage. Our evalu-

ations on real-world datasets from Alibaba OLTP Database show

that iSQUAD achieves an iSQ root cause diagnosis average F1-

score of 80.4%, and outperforms existing diagnostic tools in terms

of accuracy and efﬁciency.

PVLDB Reference Format:

M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X Jiang, H. Hu, C. Luo, Y. Li,

N. Qiu, F. Li, C. Chen and D. Pei. Diagnosing Root Causes of Intermittent

Slow Queries in Cloud Databases. PVLDB, 13(8): 1176-1189, 2020.

DOI: https://doi.org/10.14778/3389133.3389136

Work was done while the author was interning at Alibaba Group.

†

Work was done while the author was a visiting scholar at Alibaba

Group.

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 13, No. 8

ISSN 2150-8097.

DOI: https://doi.org/10.14778/3389133.3389136

1. INTRODUCTION

The growing cloud database services, such as Amazon Rela-

tional Database Service, Azure SQL Database, Google Cloud SQL

and Alibaba OLTP Database, are critical infrastructures that sup-

port daily operations and businesses of enterprises. Service inter-

ruptions or performance hiccups in databases can lead to severe rev-

enue loss and brand damage. Therefore, databases are always under

constant monitoring, where the detection and elimination of slow

queries are of great importance to service stability. Most database

systems, such as MySQL, Oracle, SQL Server, automatically log

detailed information of those queries whose completion time is

over a user-deﬁned threshold [7, 37, 43], i.e., slow queries. Some

slow queries result from internal reasons, such as nature of com-

plexity, lack of indexes and poorly-written SQL statements, which

can be automatically analyzed and optimized [13,32,34,42]. Many

other slow queries, however, result from intermittent performance

issues that are external (e.g., at database or machine levels), and we

name them Intermittent Slow Queries (iSQs).

Usually, iSQs are the cardinal symptom of performance issues or

even failures in cloud databases. As iSQs are intermittent, service

developers and customers expect them to be responsive as normal,

where sudden increases of latency have huge impacts. For exam-

ple, during web browsing, an iSQ may lead to unexpected web page

loading delay. It has been reported that every 0.1s of loading de-

lay would cost Amazon 1% in sales, and every 0.5s of additional

load delay for Google search results would led to a 20% drop in

trafﬁc [30]. We obtain several performance issue records carefully

noted by DBAs of Alibaba OLTP Database in a year span: when a

performance issue occurs, a burst of iSQs lasts for minutes. As a

matter of fact, manually diagnosing root causes of iSQs takes tens

of minutes, which is both time consuming and error-prone.

Diagnosing root causes of iSQs gets crucial and challenging in

cloud. First, iSQ occurrences become increasingly common. Mul-

tiple database instances may reside on the same physical machines

for better utilization, which in turn can cause inter-database re-

source contentions. Second, root causes of iSQs vary greatly. In-

frastructures of cloud databases are more complex than those of

on-premise databases [29], making it harder for DBAs to diagnose

root causes. Precisely, this complexity can be triggered by instance

migrations, expansions, storage decoupling, etc. Third, massive

database instances in cloud make iSQs great in population. For ex-

ample, tens of thousands of iSQs are generated in Alibaba OLTP

Database per day. In addition, roughly 83% of enterprise work-

loads are forecasted to be in the cloud by 2020 [12]. This trend

makes it critical to efﬁciently diagnose the root causes of iSQs.

In this work, we aim to diagnose root causes of iSQs in cloud

databases with minimal human intervention. We learn about symp-

1176

toms and root causes from failure records noted by DBAs of Al-

ibaba OLTP Database, and we underscore four observations:

1) DBAs need to scan hundreds of Key Performance Indicators

(KPIs) to ﬁnd out performance issue symptoms. These KPIs are

classiﬁed by DBAs to eight types corresponding to different root

causes (as summarized in Table 1). Traditional root cause analysis

(RCA) [2,6, 9, 18], however, does not have the capability of specif-

ically distinguishing multiple types of KPI symptoms to diagnose

the root causes of iSQs. For instance, by using system monitoring

data, i.e., single KPI alone (or a single type of KPIs), we usually

cannot pinpoint iSQs’ root causes [10].

2) Performance issue symptoms mainly include different patterns

of KPIs. We summarize three sets of symmetric KPI patterns, i.e.,

spike up or down, level shift up or down, and void. We observe

that even if two iSQs have the identical set of anomalous KPIs (but

with distinct anomaly behaviors), their root causes can differ. Thus,

purely based on detecting KPI anomalies as normal or abnormal we

cannot precisely diagnose iSQs’ root causes [6, 45].

3) One anomalous KPI is usually accompanied by another one

or more anomalous KPIs. Certain KPIs are highly correlated [24],

and rapid fault propagation in databases renders them anomalous

almost simultaneously. We observe that the way in which a KPI

anomaly propagates can be either unidirectional or bidirectional.

4) Similar symptoms are correlated to the same root cause. In

each category of root causes, KPI symptoms of performance issues

are similar to each other’s. For instance, KPIs in the same type can

substitute each other, but their anomaly categories remain constant.

Nevertheless, it is infeasible to enumerate and verify all possible

causalities between anomalous KPIs and root causes [36].

As a result, iSQs with various KPI ﬂuctuation patterns appear to

have complex relationships with diverse root causes. To discover

and untangle such relationships, we have made efforts to explore

machine learning (ML) based approaches, but have encountered

many challenges during this process. First, anomalous KPIs need

to be properly detected when an iSQ occurs. Traditional anomaly

detection methods recognize only anomalies themselves, but not

anomaly types (i.e., KPI ﬂuctuation changes such as spike up or

down, level shift up or down). The availability of such information

is vital to ensure high accuracy of subsequent diagnoses. Second,

based on detected KPI ﬂuctuation patterns, the root cause of that

iSQ has to be identiﬁed from numbers of candidates. Standard su-

pervised learning methods are not suitable for such diagnoses be-

cause the case-by-case labeling of root causes is prohibitive. An

iSQ can trigger many anomalous KPIs and lead to tremendous in-

vestigation, taking hours of DBAs’ labor. Third, though unsuper-

vised learning (e.g., clustering) is an eligible approach to easing

the labeling task for DBAs, it only retains limited efﬁcacy to in-

spect every cluster. It is known to be hard to make clusters that are

both intuitive (or interpretable) to DBAs and accurate [26].

To address the aforementioned challenges, we design iSQUAD

(Intermittent Slow QUery Anomaly Diagnoser), a comprehensive

framework for iSQ root cause diagnoses with a loose requirement

for human intervention. In detail, we adopt Anomaly Extraction

and Dependency Cleansing in place of traditional anomaly detec-

tion approaches to tackle the ﬁrst challenge of anomaly diversity.

For labeling overhead reduction, Type-Oriented Pattern Integra-

tion Clustering (TOPIC) is proposed to cluster iSQs of the same

root causes together, considering both KPIs and anomaly types.

In this way, DBAs only need to explore one representative root

cause in each cluster rather than label numbers of them individu-

ally. For clustering interpretability, we take advantage of Bayesian

Case Model to extract a case-based representation for each cluster,

which is easier for DBAs to investigate. In a nutshell, iSQUAD

consists of two stages: an ofﬂine clustering & explanation stage

and an online root cause diagnosis & update stage. The ofﬂine

stage is run ﬁrst to obtain the clusters and root causes, which are

then used by the online stage for future diagnoses. DBAs only need

to label each iSQ cluster once, unless a new type of iSQs emerges.

By using iSQUAD, we signiﬁcantly reduce the burden of iSQ root

cause diagnoses for DBAs on cloud database platforms.

The key contributions of our work are as follows:

• We identify the problem of Intermittent Slow Queries in cloud

databases, and design a scalable framework called iSQUAD that

provides accurate and efﬁcient root cause diagnosis of iSQs. It

adopts machine learning techniques, while overcomes the inher-

ent obstacles in terms of versatility, labeling overhead and inter-

pretability.

• We apply Anomaly Extraction of KPIs in place of anomaly de-

tection to distinguish anomaly types. A novel clustering algo-

rithm TOPIC is proposed to reduce the labeling overheads.

• To the best of our knowledge, we are the ﬁrst to apply and inte-

grate case-based reasoning via the Bayesian Case Model [23] in

database domain and to introduce the case-subspace representa-

tions to DBAs for labeling.

• We conduct extensive experiments for iSQUAD’s evaluation and

demonstrate that our method achieves an average F1-score of

80.4%, i.e., 49.2% higher than that of the previous technique.

Furthermore, we have deployed a prototype of iSQUAD in a

real-world cloud database service. iSQUAD helps DBAs diag-

nose all ten root causes of several hundred iSQs in 80 minutes,

which is approximately thirty times faster than traditional case-

by-case diagnosis.

The rest of this paper is organized as follows: §2 describes iSQs,

the motivation and challenges of their root cause diagnoses. §3

overviews our framework, iSQUAD. §4 discusses detailed ML tech-

niques in iSQUAD that build comprehensive clustering models. §5

shows our experimental results. §6 presents a case study in a real-

world cloud database and our future work. §7 reviews the related

work, and §8 concludes the paper.

2. BACKGROUND AND MOTIVATION

In this section, we ﬁrst introduce background on iSQs. Then,

we conduct an empirical study from database performance issue

records to gain some insights. Finally, we present three key chal-

lenges in diagnosing the root causes of iSQs.

2.1 Background

Alibaba OLTP Database. Alibaba OLTP Database (in short as Al-

ibaba Database) is a multi-tenant DBPaaS supporting a number of

ﬁrst-party services including Taobao (customer-to-customer online

retail service), Tmall (business-to-consumer online retail service),

DingTalk (enterprise collaboration service), Cainiao (logistics ser-

vice), etc. This database houses over one hundred thousand ac-

tively running instances across tens of geographical regions. To

monitor the compliance with SLAs (Service-Level Agreements),

the database is equipped with a measurement system [9] that con-

tinuously collects logs and KPIs (Key Performance Indicators).

Intermittent Slow Queries (iSQs). Most database systems, such

as MySQL, Oracle, SQL Server, automatically record query time

of each query execution [7, 37, 43]. The query time is the time be-

tween when an SQL query is submitted to, and when its results are

returned by, the database. We formally deﬁne Intermittent Slow

Queries (iSQs) as follows. For a SQL query Q, its t

occurrence

(whose observed execution time is X

) is an iSQ if and only if

> z and P (X

> z) < , where 1 ≤ t, i ≤ T (T is the total

1177

of 14

免费下载

关注

评论