暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ACM 2023 - BALANCE- Bayesian Linear Attribution for Root Cause Localization.pdf
217
26页
8次
2023-08-31
免费下载
95
BALANCE: Bayesian Linear Aribution for Root Cause
Localization
CHAOYU CHEN
, Ant Group, China
HANG YU
, Ant Group, China
ZHICHAO LEI, Ant Group, China
JIANGUO LI
, Ant Group, China
SHAOKANG REN, Ant Group, China
TINGKAI ZHANG, Ant Group, China
SILIN HU, Ant Group, China
JIANCHAO WANG, Ant Group, China
WENHUI SHI, OceanBase, China
Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations,
as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimen-
sional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting
the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose
BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA
through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of
the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian
multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a
forward manner while promoting sparsity and concurrently paying attention to the correlation between the
candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each
candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are
multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as
three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis
for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of
accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA
methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and
the online results further advocate its usage for real-time diagnosis in distributed data systems.
1
CCS Concepts: Software and its engineering; Information systems
Autonomous database
administration; Computing methodologies Feature selection; Regularization;
Both authors contributed equally to this work.
Corresponding author.
1
Code is available at https://github.com/ant-research/BayesianLinearAttributionForRootCauseLocalization_BALANCE.
Authors’ addresses: Chaoyu Chen, Ant Group, China, chris.ccy@antgroup.com; Hang Yu, Ant Group, China, hyu.hugo@
antgroup.com; Zhichao Lei, Ant Group, China, leizhichao.lzc@antgroup.com; Jianguo Li, Ant Group, China, lijg.zero@
antgroup.com; Shaokang Ren, Ant Group, China, renshaokang.rsk@antgroup.com; Tingkai Zhang, Ant Group, China,
tingkai.ztk@antgroup.com; Silin Hu, Ant Group, China, husilin.hsl@antgroup.com; Jianchao Wang, Ant Group, China,
luli.wjc@antgroup.com; Wenhui Shi, OceanBase, China, yushun.swh@oceanbase.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/5-ART95 $15.00
https://doi.org/10.1145/3588949
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
95:2 Chaoyu Chen et al.
Additional Key Words and Phrases: Root Cause Analysis, Bayesian Method, Bad SQLs, Faults Diagnosis,
Distributed System, Attribution Analysis, Explainable AI
ACM Reference Format:
Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang,
and Wenhui Shi. 2023. BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proc. ACM Manag.
Data 1, 1, Article 95 (May 2023), 26 pages. https://doi.org/10.1145/3588949
1 INTRODUCTION
System faults and incidents have a possibly tremendous inuence on distributed data systems which
are widely adopted in modern information technology (IT) and nancial companies, since they
may lead to system outrage and further incur astounding nancial loss and jeopardize customer
trust [
21
]. It has been reported by Forbes that every year IT downtime costs an estimated $26.5
billion in lost revenue alone, not to mention the indirect expense, including lost customers and
references. Thus, it is imperative to conduct fast and precise fault diagnosis and recovery before
they become service-impacting. A central task in fault diagnosis and recovery is root cause analysis
(RCA), which bridges the gap between fault detection and recovery [11, 13].
Currently, the task of RCA is mainly accomplished by site reliability engineers (SREs) with rich
operation experience. Unfortunately, such manual work becomes prohibitively slow due to the
increase of the scale and complexity of the architecture as well as the dynamic and unpredictable
nature of the system metrics and events, thus deviating from the requirement of eciency. Indeed,
as mentioned in [
19
], it can take as long as several hours of manual work to diagnose the root
causes of intermittent slow queries in distributed database systems. This has sparked considerable
research eorts toward designing automated RCA algorithms based on machine learning so as to
provide aid in saving time and ultimately money.
Literature on RCA algorithms can be broadly divided into two categories. The rst one focuses on
multidimensional root cause localization [
5
,
32
,
47
], which seeks to explain the abnormal behavior
of the additive key performance indicators (KPIs) by identifying the fault-indicating combinations
of their corresponding multi-dimensional attributes. The success of these algorithms relies on two
assumptions: 1) the value of the KPI in each dimension equals the sum of the values of its attributes
and 2) all the KPIs and their attributes can be monitored. However, these two assumptions can be
too restrictive in real-world problems, and a more practical setting is to attribute the anomalies to
root cause candidates without additive assumptions while allowing for missing data. On the other
hand, the second category revolves around graph-based RCA algorithms [
14
,
24
,
38
,
39
]. These
approaches typically rst construct a causal graph based on tracing service calls or causal discovery
algorithms [
46
] and then nd the root cause node via rule-based traversing or random walk. A
major impediment to the application of tracing graphs and rule-based traversing is that it is system
invasive and typically incurs arduous work on enumerating all traces and rules. As an alternative,
causal discovery methods are employed to learn the graph structure as in [
39
]. Unfortunately, the
causal discovery methods suer from both high computational and sample complexity [
46
], and in
consequence, they can be distressingly slow for large graphs and may lead to inaccurate results
when the number of observations for all metrics in the graph is small. After obtaining the graph,
the random walk methods are heuristic and might fail to converge to the root cause when the
number of random walks is not suciently large.
In this paper, we explore alternatives and recast the RCA problem as a feature attribution
problem [
16
]. To the best of our knowledge, we are among the rst to analyze the root cause
through the lens of attribution. As a commonly used tool in explainable AI (XAI), attribution
methods assign attribution scores to input features, the absolute value of which represents their
importance to the model prediction or performance [
16
]. Analogously, we aim to nd the root
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
of 26
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜