暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
2020-6_Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases _Minghua Ma.pdf
67
14页
3次
2022-01-27
免费下载
Diagnosing Root Causes of Intermittent Slow Queries
in Cloud Databases
Minghua Ma
1,2
, Zheng Yin
2
, Shenglin Zhang
3
, Sheng Wang
2
, Christopher Zheng
1
, Xinhao Jiang
1
,
Hanwen Hu
1
, Cheng Luo
1
, Yilin Li
1
, Nengjun Qiu
2
, Feifei Li
2
, Changcheng Chen
2
and Dan Pei
1
1
Tsinghua University, {mmh16, luo-c18, liyilin16}@mails.tsinghua.edu.cn, peidan@tsinghua.edu.cn
2
Alibaba Group, {yinzheng.yz, sh.wang, nengjun.qnj, lifeifei}@alibaba-inc.com
3
Nankai University, zhangsl@nankai.edu.cn
ABSTRACT
With the growing market of cloud databases, careful detection and
elimination of slow queries are of great importance to service sta-
bility. Previous studies focus on optimizing the slow queries that
result from internal reasons (e.g., poorly-written SQLs). In this
work, we discover a different set of slow queries which might be
more hazardous to database users than other slow queries. We name
such queries Intermittent Slow Queries (iSQs), because they usu-
ally result from intermittent performance issues that are external
(e.g., at database or machine levels). Diagnosing root causes of
iSQs is a tough but very valuable task.
This paper presents iSQUAD, Intermittent Slow QUery Anomaly
Diagnoser, a framework that can diagnose the root causes of iSQs
with a loose requirement for human intervention. Due to the com-
plexity of this issue, a machine learning approach comes to light
naturally to draw the interconnection between iSQs and root causes,
but it faces challenges in terms of versatility, labeling overhead and
interpretability. To tackle these challenges, we design four com-
ponents, i.e., Anomaly Extraction, Dependency Cleansing, Type-
Oriented Pattern Integration Clustering (TOPIC) and Bayesian Case
Model. iSQUAD consists of an offline clustering & explanation
stage and an online root cause diagnosis & update stage. DBAs
need to label each iSQ cluster only once at the offline stage un-
less a new type of iSQs emerges at the online stage. Our evalu-
ations on real-world datasets from Alibaba OLTP Database show
that iSQUAD achieves an iSQ root cause diagnosis average F1-
score of 80.4%, and outperforms existing diagnostic tools in terms
of accuracy and efficiency.
PVLDB Reference Format:
M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X Jiang, H. Hu, C. Luo, Y. Li,
N. Qiu, F. Li, C. Chen and D. Pei. Diagnosing Root Causes of Intermittent
Slow Queries in Cloud Databases. PVLDB, 13(8): 1176-1189, 2020.
DOI: https://doi.org/10.14778/3389133.3389136
*
Work was done while the author was interning at Alibaba Group.
Work was done while the author was a visiting scholar at Alibaba
Group.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 13, No. 8
ISSN 2150-8097.
DOI: https://doi.org/10.14778/3389133.3389136
1. INTRODUCTION
The growing cloud database services, such as Amazon Rela-
tional Database Service, Azure SQL Database, Google Cloud SQL
and Alibaba OLTP Database, are critical infrastructures that sup-
port daily operations and businesses of enterprises. Service inter-
ruptions or performance hiccups in databases can lead to severe rev-
enue loss and brand damage. Therefore, databases are always under
constant monitoring, where the detection and elimination of slow
queries are of great importance to service stability. Most database
systems, such as MySQL, Oracle, SQL Server, automatically log
detailed information of those queries whose completion time is
over a user-defined threshold [7, 37, 43], i.e., slow queries. Some
slow queries result from internal reasons, such as nature of com-
plexity, lack of indexes and poorly-written SQL statements, which
can be automatically analyzed and optimized [13,32,34,42]. Many
other slow queries, however, result from intermittent performance
issues that are external (e.g., at database or machine levels), and we
name them Intermittent Slow Queries (iSQs).
Usually, iSQs are the cardinal symptom of performance issues or
even failures in cloud databases. As iSQs are intermittent, service
developers and customers expect them to be responsive as normal,
where sudden increases of latency have huge impacts. For exam-
ple, during web browsing, an iSQ may lead to unexpected web page
loading delay. It has been reported that every 0.1s of loading de-
lay would cost Amazon 1% in sales, and every 0.5s of additional
load delay for Google search results would led to a 20% drop in
traffic [30]. We obtain several performance issue records carefully
noted by DBAs of Alibaba OLTP Database in a year span: when a
performance issue occurs, a burst of iSQs lasts for minutes. As a
matter of fact, manually diagnosing root causes of iSQs takes tens
of minutes, which is both time consuming and error-prone.
Diagnosing root causes of iSQs gets crucial and challenging in
cloud. First, iSQ occurrences become increasingly common. Mul-
tiple database instances may reside on the same physical machines
for better utilization, which in turn can cause inter-database re-
source contentions. Second, root causes of iSQs vary greatly. In-
frastructures of cloud databases are more complex than those of
on-premise databases [29], making it harder for DBAs to diagnose
root causes. Precisely, this complexity can be triggered by instance
migrations, expansions, storage decoupling, etc. Third, massive
database instances in cloud make iSQs great in population. For ex-
ample, tens of thousands of iSQs are generated in Alibaba OLTP
Database per day. In addition, roughly 83% of enterprise work-
loads are forecasted to be in the cloud by 2020 [12]. This trend
makes it critical to efficiently diagnose the root causes of iSQs.
In this work, we aim to diagnose root causes of iSQs in cloud
databases with minimal human intervention. We learn about symp-
1176
toms and root causes from failure records noted by DBAs of Al-
ibaba OLTP Database, and we underscore four observations:
1) DBAs need to scan hundreds of Key Performance Indicators
(KPIs) to find out performance issue symptoms. These KPIs are
classified by DBAs to eight types corresponding to different root
causes (as summarized in Table 1). Traditional root cause analysis
(RCA) [2,6, 9, 18], however, does not have the capability of specif-
ically distinguishing multiple types of KPI symptoms to diagnose
the root causes of iSQs. For instance, by using system monitoring
data, i.e., single KPI alone (or a single type of KPIs), we usually
cannot pinpoint iSQs’ root causes [10].
2) Performance issue symptoms mainly include different patterns
of KPIs. We summarize three sets of symmetric KPI patterns, i.e.,
spike up or down, level shift up or down, and void. We observe
that even if two iSQs have the identical set of anomalous KPIs (but
with distinct anomaly behaviors), their root causes can differ. Thus,
purely based on detecting KPI anomalies as normal or abnormal we
cannot precisely diagnose iSQs’ root causes [6, 45].
3) One anomalous KPI is usually accompanied by another one
or more anomalous KPIs. Certain KPIs are highly correlated [24],
and rapid fault propagation in databases renders them anomalous
almost simultaneously. We observe that the way in which a KPI
anomaly propagates can be either unidirectional or bidirectional.
4) Similar symptoms are correlated to the same root cause. In
each category of root causes, KPI symptoms of performance issues
are similar to each other’s. For instance, KPIs in the same type can
substitute each other, but their anomaly categories remain constant.
Nevertheless, it is infeasible to enumerate and verify all possible
causalities between anomalous KPIs and root causes [36].
As a result, iSQs with various KPI fluctuation patterns appear to
have complex relationships with diverse root causes. To discover
and untangle such relationships, we have made efforts to explore
machine learning (ML) based approaches, but have encountered
many challenges during this process. First, anomalous KPIs need
to be properly detected when an iSQ occurs. Traditional anomaly
detection methods recognize only anomalies themselves, but not
anomaly types (i.e., KPI fluctuation changes such as spike up or
down, level shift up or down). The availability of such information
is vital to ensure high accuracy of subsequent diagnoses. Second,
based on detected KPI fluctuation patterns, the root cause of that
iSQ has to be identified from numbers of candidates. Standard su-
pervised learning methods are not suitable for such diagnoses be-
cause the case-by-case labeling of root causes is prohibitive. An
iSQ can trigger many anomalous KPIs and lead to tremendous in-
vestigation, taking hours of DBAs’ labor. Third, though unsuper-
vised learning (e.g., clustering) is an eligible approach to easing
the labeling task for DBAs, it only retains limited efficacy to in-
spect every cluster. It is known to be hard to make clusters that are
both intuitive (or interpretable) to DBAs and accurate [26].
To address the aforementioned challenges, we design iSQUAD
(Intermittent Slow QUery Anomaly Diagnoser), a comprehensive
framework for iSQ root cause diagnoses with a loose requirement
for human intervention. In detail, we adopt Anomaly Extraction
and Dependency Cleansing in place of traditional anomaly detec-
tion approaches to tackle the first challenge of anomaly diversity.
For labeling overhead reduction, Type-Oriented Pattern Integra-
tion Clustering (TOPIC) is proposed to cluster iSQs of the same
root causes together, considering both KPIs and anomaly types.
In this way, DBAs only need to explore one representative root
cause in each cluster rather than label numbers of them individu-
ally. For clustering interpretability, we take advantage of Bayesian
Case Model to extract a case-based representation for each cluster,
which is easier for DBAs to investigate. In a nutshell, iSQUAD
consists of two stages: an offline clustering & explanation stage
and an online root cause diagnosis & update stage. The offline
stage is run first to obtain the clusters and root causes, which are
then used by the online stage for future diagnoses. DBAs only need
to label each iSQ cluster once, unless a new type of iSQs emerges.
By using iSQUAD, we significantly reduce the burden of iSQ root
cause diagnoses for DBAs on cloud database platforms.
The key contributions of our work are as follows:
We identify the problem of Intermittent Slow Queries in cloud
databases, and design a scalable framework called iSQUAD that
provides accurate and efficient root cause diagnosis of iSQs. It
adopts machine learning techniques, while overcomes the inher-
ent obstacles in terms of versatility, labeling overhead and inter-
pretability.
We apply Anomaly Extraction of KPIs in place of anomaly de-
tection to distinguish anomaly types. A novel clustering algo-
rithm TOPIC is proposed to reduce the labeling overheads.
To the best of our knowledge, we are the first to apply and inte-
grate case-based reasoning via the Bayesian Case Model [23] in
database domain and to introduce the case-subspace representa-
tions to DBAs for labeling.
We conduct extensive experiments for iSQUAD’s evaluation and
demonstrate that our method achieves an average F1-score of
80.4%, i.e., 49.2% higher than that of the previous technique.
Furthermore, we have deployed a prototype of iSQUAD in a
real-world cloud database service. iSQUAD helps DBAs diag-
nose all ten root causes of several hundred iSQs in 80 minutes,
which is approximately thirty times faster than traditional case-
by-case diagnosis.
The rest of this paper is organized as follows: §2 describes iSQs,
the motivation and challenges of their root cause diagnoses. §3
overviews our framework, iSQUAD. §4 discusses detailed ML tech-
niques in iSQUAD that build comprehensive clustering models. §5
shows our experimental results. §6 presents a case study in a real-
world cloud database and our future work. §7 reviews the related
work, and §8 concludes the paper.
2. BACKGROUND AND MOTIVATION
In this section, we first introduce background on iSQs. Then,
we conduct an empirical study from database performance issue
records to gain some insights. Finally, we present three key chal-
lenges in diagnosing the root causes of iSQs.
2.1 Background
Alibaba OLTP Database. Alibaba OLTP Database (in short as Al-
ibaba Database) is a multi-tenant DBPaaS supporting a number of
first-party services including Taobao (customer-to-customer online
retail service), Tmall (business-to-consumer online retail service),
DingTalk (enterprise collaboration service), Cainiao (logistics ser-
vice), etc. This database houses over one hundred thousand ac-
tively running instances across tens of geographical regions. To
monitor the compliance with SLAs (Service-Level Agreements),
the database is equipped with a measurement system [9] that con-
tinuously collects logs and KPIs (Key Performance Indicators).
Intermittent Slow Queries (iSQs). Most database systems, such
as MySQL, Oracle, SQL Server, automatically record query time
of each query execution [7, 37, 43]. The query time is the time be-
tween when an SQL query is submitted to, and when its results are
returned by, the database. We formally define Intermittent Slow
Queries (iSQs) as follows. For a SQL query Q, its t
th
occurrence
Q
t
(whose observed execution time is X
t
) is an iSQ if and only if
X
t
> z and P (X
i
> z) < , where 1 t, i T (T is the total
1177
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜