toms and root causes from failure records noted by DBAs of Al-
ibaba OLTP Database, and we underscore four observations:
1) DBAs need to scan hundreds of Key Performance Indicators
(KPIs) to find out performance issue symptoms. These KPIs are
classified by DBAs to eight types corresponding to different root
causes (as summarized in Table 1). Traditional root cause analysis
(RCA) [2,6, 9, 18], however, does not have the capability of specif-
ically distinguishing multiple types of KPI symptoms to diagnose
the root causes of iSQs. For instance, by using system monitoring
data, i.e., single KPI alone (or a single type of KPIs), we usually
cannot pinpoint iSQs’ root causes [10].
2) Performance issue symptoms mainly include different patterns
of KPIs. We summarize three sets of symmetric KPI patterns, i.e.,
spike up or down, level shift up or down, and void. We observe
that even if two iSQs have the identical set of anomalous KPIs (but
with distinct anomaly behaviors), their root causes can differ. Thus,
purely based on detecting KPI anomalies as normal or abnormal we
cannot precisely diagnose iSQs’ root causes [6, 45].
3) One anomalous KPI is usually accompanied by another one
or more anomalous KPIs. Certain KPIs are highly correlated [24],
and rapid fault propagation in databases renders them anomalous
almost simultaneously. We observe that the way in which a KPI
anomaly propagates can be either unidirectional or bidirectional.
4) Similar symptoms are correlated to the same root cause. In
each category of root causes, KPI symptoms of performance issues
are similar to each other’s. For instance, KPIs in the same type can
substitute each other, but their anomaly categories remain constant.
Nevertheless, it is infeasible to enumerate and verify all possible
causalities between anomalous KPIs and root causes [36].
As a result, iSQs with various KPI fluctuation patterns appear to
have complex relationships with diverse root causes. To discover
and untangle such relationships, we have made efforts to explore
machine learning (ML) based approaches, but have encountered
many challenges during this process. First, anomalous KPIs need
to be properly detected when an iSQ occurs. Traditional anomaly
detection methods recognize only anomalies themselves, but not
anomaly types (i.e., KPI fluctuation changes such as spike up or
down, level shift up or down). The availability of such information
is vital to ensure high accuracy of subsequent diagnoses. Second,
based on detected KPI fluctuation patterns, the root cause of that
iSQ has to be identified from numbers of candidates. Standard su-
pervised learning methods are not suitable for such diagnoses be-
cause the case-by-case labeling of root causes is prohibitive. An
iSQ can trigger many anomalous KPIs and lead to tremendous in-
vestigation, taking hours of DBAs’ labor. Third, though unsuper-
vised learning (e.g., clustering) is an eligible approach to easing
the labeling task for DBAs, it only retains limited efficacy to in-
spect every cluster. It is known to be hard to make clusters that are
both intuitive (or interpretable) to DBAs and accurate [26].
To address the aforementioned challenges, we design iSQUAD
(Intermittent Slow QUery Anomaly Diagnoser), a comprehensive
framework for iSQ root cause diagnoses with a loose requirement
for human intervention. In detail, we adopt Anomaly Extraction
and Dependency Cleansing in place of traditional anomaly detec-
tion approaches to tackle the first challenge of anomaly diversity.
For labeling overhead reduction, Type-Oriented Pattern Integra-
tion Clustering (TOPIC) is proposed to cluster iSQs of the same
root causes together, considering both KPIs and anomaly types.
In this way, DBAs only need to explore one representative root
cause in each cluster rather than label numbers of them individu-
ally. For clustering interpretability, we take advantage of Bayesian
Case Model to extract a case-based representation for each cluster,
which is easier for DBAs to investigate. In a nutshell, iSQUAD
consists of two stages: an offline clustering & explanation stage
and an online root cause diagnosis & update stage. The offline
stage is run first to obtain the clusters and root causes, which are
then used by the online stage for future diagnoses. DBAs only need
to label each iSQ cluster once, unless a new type of iSQs emerges.
By using iSQUAD, we significantly reduce the burden of iSQ root
cause diagnoses for DBAs on cloud database platforms.
The key contributions of our work are as follows:
• We identify the problem of Intermittent Slow Queries in cloud
databases, and design a scalable framework called iSQUAD that
provides accurate and efficient root cause diagnosis of iSQs. It
adopts machine learning techniques, while overcomes the inher-
ent obstacles in terms of versatility, labeling overhead and inter-
pretability.
• We apply Anomaly Extraction of KPIs in place of anomaly de-
tection to distinguish anomaly types. A novel clustering algo-
rithm TOPIC is proposed to reduce the labeling overheads.
• To the best of our knowledge, we are the first to apply and inte-
grate case-based reasoning via the Bayesian Case Model [23] in
database domain and to introduce the case-subspace representa-
tions to DBAs for labeling.
• We conduct extensive experiments for iSQUAD’s evaluation and
demonstrate that our method achieves an average F1-score of
80.4%, i.e., 49.2% higher than that of the previous technique.
Furthermore, we have deployed a prototype of iSQUAD in a
real-world cloud database service. iSQUAD helps DBAs diag-
nose all ten root causes of several hundred iSQs in 80 minutes,
which is approximately thirty times faster than traditional case-
by-case diagnosis.
The rest of this paper is organized as follows: §2 describes iSQs,
the motivation and challenges of their root cause diagnoses. §3
overviews our framework, iSQUAD. §4 discusses detailed ML tech-
niques in iSQUAD that build comprehensive clustering models. §5
shows our experimental results. §6 presents a case study in a real-
world cloud database and our future work. §7 reviews the related
work, and §8 concludes the paper.
2. BACKGROUND AND MOTIVATION
In this section, we first introduce background on iSQs. Then,
we conduct an empirical study from database performance issue
records to gain some insights. Finally, we present three key chal-
lenges in diagnosing the root causes of iSQs.
2.1 Background
Alibaba OLTP Database. Alibaba OLTP Database (in short as Al-
ibaba Database) is a multi-tenant DBPaaS supporting a number of
first-party services including Taobao (customer-to-customer online
retail service), Tmall (business-to-consumer online retail service),
DingTalk (enterprise collaboration service), Cainiao (logistics ser-
vice), etc. This database houses over one hundred thousand ac-
tively running instances across tens of geographical regions. To
monitor the compliance with SLAs (Service-Level Agreements),
the database is equipped with a measurement system [9] that con-
tinuously collects logs and KPIs (Key Performance Indicators).
Intermittent Slow Queries (iSQs). Most database systems, such
as MySQL, Oracle, SQL Server, automatically record query time
of each query execution [7, 37, 43]. The query time is the time be-
tween when an SQL query is submitted to, and when its results are
returned by, the database. We formally define Intermittent Slow
Queries (iSQs) as follows. For a SQL query Q, its t
th
occurrence
Q
t
(whose observed execution time is X
t
) is an iSQ if and only if
X
t
> z and P (X
i
> z) < , where 1 ≤ t, i ≤ T (T is the total
1177
评论