95:2 Chaoyu Chen et al.
Additional Key Words and Phrases: Root Cause Analysis, Bayesian Method, Bad SQLs, Faults Diagnosis,
Distributed System, Attribution Analysis, Explainable AI
ACM Reference Format:
Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang,
and Wenhui Shi. 2023. BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proc. ACM Manag.
Data 1, 1, Article 95 (May 2023), 26 pages. https://doi.org/10.1145/3588949
1 INTRODUCTION
System faults and incidents have a possibly tremendous inuence on distributed data systems which
are widely adopted in modern information technology (IT) and nancial companies, since they
may lead to system outrage and further incur astounding nancial loss and jeopardize customer
trust [
21
]. It has been reported by Forbes that every year IT downtime costs an estimated $26.5
billion in lost revenue alone, not to mention the indirect expense, including lost customers and
references. Thus, it is imperative to conduct fast and precise fault diagnosis and recovery before
they become service-impacting. A central task in fault diagnosis and recovery is root cause analysis
(RCA), which bridges the gap between fault detection and recovery [11, 13].
Currently, the task of RCA is mainly accomplished by site reliability engineers (SREs) with rich
operation experience. Unfortunately, such manual work becomes prohibitively slow due to the
increase of the scale and complexity of the architecture as well as the dynamic and unpredictable
nature of the system metrics and events, thus deviating from the requirement of eciency. Indeed,
as mentioned in [
19
], it can take as long as several hours of manual work to diagnose the root
causes of intermittent slow queries in distributed database systems. This has sparked considerable
research eorts toward designing automated RCA algorithms based on machine learning so as to
provide aid in saving time and ultimately money.
Literature on RCA algorithms can be broadly divided into two categories. The rst one focuses on
multidimensional root cause localization [
5
,
32
,
47
], which seeks to explain the abnormal behavior
of the additive key performance indicators (KPIs) by identifying the fault-indicating combinations
of their corresponding multi-dimensional attributes. The success of these algorithms relies on two
assumptions: 1) the value of the KPI in each dimension equals the sum of the values of its attributes
and 2) all the KPIs and their attributes can be monitored. However, these two assumptions can be
too restrictive in real-world problems, and a more practical setting is to attribute the anomalies to
root cause candidates without additive assumptions while allowing for missing data. On the other
hand, the second category revolves around graph-based RCA algorithms [
14
,
24
,
38
,
39
]. These
approaches typically rst construct a causal graph based on tracing service calls or causal discovery
algorithms [
46
] and then nd the root cause node via rule-based traversing or random walk. A
major impediment to the application of tracing graphs and rule-based traversing is that it is system
invasive and typically incurs arduous work on enumerating all traces and rules. As an alternative,
causal discovery methods are employed to learn the graph structure as in [
39
]. Unfortunately, the
causal discovery methods suer from both high computational and sample complexity [
46
], and in
consequence, they can be distressingly slow for large graphs and may lead to inaccurate results
when the number of observations for all metrics in the graph is small. After obtaining the graph,
the random walk methods are heuristic and might fail to converge to the root cause when the
number of random walks is not suciently large.
In this paper, we explore alternatives and recast the RCA problem as a feature attribution
problem [
16
]. To the best of our knowledge, we are among the rst to analyze the root cause
through the lens of attribution. As a commonly used tool in explainable AI (XAI), attribution
methods assign attribution scores to input features, the absolute value of which represents their
importance to the model prediction or performance [
16
]. Analogously, we aim to nd the root
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
评论