ACM 2023 - BALANCE- Bayesian Linear Attribution for Root Cause Localization.pdf

章芋文

217

26页

8次

2023-08-31

免费下载

BALANCE: Bayesian Linear Aribution for Root Cause

Localization

CHAOYU CHEN

∗

, Ant Group, China

HANG YU

∗

, Ant Group, China

ZHICHAO LEI, Ant Group, China

JIANGUO LI

†

, Ant Group, China

SHAOKANG REN, Ant Group, China

TINGKAI ZHANG, Ant Group, China

SILIN HU, Ant Group, China

JIANCHAO WANG, Ant Group, China

WENHUI SHI, OceanBase, China

Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations,

as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimen-

sional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting

the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose

BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA

through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of

the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian

multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a

forward manner while promoting sparsity and concurrently paying attention to the correlation between the

candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each

candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are

multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as

three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis

for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of

accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA

methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and

the online results further advocate its usage for real-time diagnosis in distributed data systems.

CCS Concepts: • Software and its engineering; • Information systems

→

Autonomous database

administration; • Computing methodologies → Feature selection; Regularization;

∗

Both authors contributed equally to this work.

†

Corresponding author.

Code is available at https://github.com/ant-research/BayesianLinearAttributionForRootCauseLocalization_BALANCE.

Authors’ addresses: Chaoyu Chen, Ant Group, China, chris.ccy@antgroup.com; Hang Yu, Ant Group, China, hyu.hugo@

antgroup.com; Zhichao Lei, Ant Group, China, leizhichao.lzc@antgroup.com; Jianguo Li, Ant Group, China, lijg.zero@

antgroup.com; Shaokang Ren, Ant Group, China, renshaokang.rsk@antgroup.com; Tingkai Zhang, Ant Group, China,

tingkai.ztk@antgroup.com; Silin Hu, Ant Group, China, husilin.hsl@antgroup.com; Jianchao Wang, Ant Group, China,

luli.wjc@antgroup.com; Wenhui Shi, OceanBase, China, yushun.swh@oceanbase.com.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2836-6573/2023/5-ART95 $15.00

https://doi.org/10.1145/3588949

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.

95:2 Chaoyu Chen et al.

Additional Key Words and Phrases: Root Cause Analysis, Bayesian Method, Bad SQLs, Faults Diagnosis,

Distributed System, Attribution Analysis, Explainable AI

ACM Reference Format:

Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang,

and Wenhui Shi. 2023. BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proc. ACM Manag.

Data 1, 1, Article 95 (May 2023), 26 pages. https://doi.org/10.1145/3588949

1 INTRODUCTION

System faults and incidents have a possibly tremendous inuence on distributed data systems which

are widely adopted in modern information technology (IT) and nancial companies, since they

may lead to system outrage and further incur astounding nancial loss and jeopardize customer

trust [

]. It has been reported by Forbes that every year IT downtime costs an estimated $26.5

billion in lost revenue alone, not to mention the indirect expense, including lost customers and

references. Thus, it is imperative to conduct fast and precise fault diagnosis and recovery before

they become service-impacting. A central task in fault diagnosis and recovery is root cause analysis

(RCA), which bridges the gap between fault detection and recovery [11, 13].

Currently, the task of RCA is mainly accomplished by site reliability engineers (SREs) with rich

operation experience. Unfortunately, such manual work becomes prohibitively slow due to the

increase of the scale and complexity of the architecture as well as the dynamic and unpredictable

nature of the system metrics and events, thus deviating from the requirement of eciency. Indeed,

as mentioned in [

], it can take as long as several hours of manual work to diagnose the root

causes of intermittent slow queries in distributed database systems. This has sparked considerable

research eorts toward designing automated RCA algorithms based on machine learning so as to

provide aid in saving time and ultimately money.

Literature on RCA algorithms can be broadly divided into two categories. The rst one focuses on

multidimensional root cause localization [

], which seeks to explain the abnormal behavior

of the additive key performance indicators (KPIs) by identifying the fault-indicating combinations

of their corresponding multi-dimensional attributes. The success of these algorithms relies on two

assumptions: 1) the value of the KPI in each dimension equals the sum of the values of its attributes

and 2) all the KPIs and their attributes can be monitored. However, these two assumptions can be

too restrictive in real-world problems, and a more practical setting is to attribute the anomalies to

root cause candidates without additive assumptions while allowing for missing data. On the other

hand, the second category revolves around graph-based RCA algorithms [

]. These

approaches typically rst construct a causal graph based on tracing service calls or causal discovery

algorithms [

] and then nd the root cause node via rule-based traversing or random walk. A

major impediment to the application of tracing graphs and rule-based traversing is that it is system

invasive and typically incurs arduous work on enumerating all traces and rules. As an alternative,

causal discovery methods are employed to learn the graph structure as in [

]. Unfortunately, the

causal discovery methods suer from both high computational and sample complexity [

], and in

consequence, they can be distressingly slow for large graphs and may lead to inaccurate results

when the number of observations for all metrics in the graph is small. After obtaining the graph,

the random walk methods are heuristic and might fail to converge to the root cause when the

number of random walks is not suciently large.

In this paper, we explore alternatives and recast the RCA problem as a feature attribution

problem [

]. To the best of our knowledge, we are among the rst to analyze the root cause

through the lens of attribution. As a commonly used tool in explainable AI (XAI), attribution

methods assign attribution scores to input features, the absolute value of which represents their

importance to the model prediction or performance [

]. Analogously, we aim to nd the root

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.

of 26

免费下载

关注

评论