
TABLE I
C
HARACTERISTICS OF DIFFERENT ANOMALY DETECTION METHODS.THE METHOD WE PROPOSE IS DBCATCHER.
FFT SR SR-CNN OmniAnomaly JumpStarter DBCatcher
Detection performance Low Low Medium High High High
Detection efficiency Low Medium High Medium Medium High
Threshold auto-adjustment Low Low Low Low Low High
Workload adaptability Low Low Medium Medium Medium High
actions per second. Therefore, detection efficiency is directly
correlated to the quantity and scope of transactions affected
by anomalies. Existing anomaly detection methods (e.g, Om-
niAnomaly, JumpStarter) require extensive data points (e.g.,
one data point per 5 seconds) to learn time series variation
features, resulting in anomaly identification too slowly.
(3) Threshold auto-adjustment. Most existing studies require
multiple thresholds for anomaly detection and these thresholds
are frequently a crucial part of the ultimate evaluation of
database state [3], [17]. Setting these thresholds under dif-
ferent workloads is non-trivial even for experienced DBAs,
as inappropriate thresholds can have side effects on detection
performance and efficiency.
(4) Workload adaptability. Existing methods are particularly
susceptible to workload variations. When workload varies, the
performance of existing machine-learning methods that have
previously been trained can plummet (e.g, SR-CNN, Omni-
Anomaly, and JumpStarter) [16]. Statistical based methods
(e.g, FFT, SR) can also be quite challenging under varied
workloads because of the difficulty of threshold adjusting.
The conclusions in Table I indicate that the mainstream
anomaly detection methods are not applicable to cloud
databases. By carefully analyzing the multivariate time series
in Tencent Cloud Database
1
, we find the following differences
between cloud database time series and the time series of
mainstream anomaly detection service objects (e.g, online
services): (1) Cloud database multivariate time series change
more frequently and with greater magnitude than other online
service time series. Figure 1 shows an example of the burst
increase in “CPU utilization” due to the increase in “Request
Per Second” increases in cloud database time series. (2)
Cloud database time series have complex variation patterns.
Other online service time series are commonly characterized
by periodic variations in time granularity by day, hour, etc.
[15], [18], [19], while cloud database time series contain
extensive irregular time series. (3) Complex functionality of
cloud databases such as data synchronization, consistency
maintenance, etc, can cause complex abnormal issues [4], [20],
[21]. These complex abnormal issues generate a wide variety
of time series trends (e.g., concept drift, spike) [4], [22].
The above analysis and findings prompted us to investigate
anomaly detection methods applicable to cloud databases.
In this paper, we find correlations among trends in the
1
Tencent Inc. is the largest social network service company in China, which
cloud databases provide data storage and management services for massive
users, supporting various applications such as social networks, games, e-
commerce, and finance.
same KPIs across databases within the same unit (see §II-B
for details). This phenomenon is called Unit Key Perfor-
mance Indicator Correlation (UKPIC). Based on the UKPIC
phenomenon, we propose an efficient cloud database online
anomaly detection system called DBCatcher.
The key contributions in this paper are as follows:
• We propose an efficient time series correlation mea-
surement method based on the UKPIC phenomenon to
calculate the correlations of time series, which can timely
identify KPIs with abnormal variation trends.
• We propose a flexible time window observation mecha-
nism that considerably enhances the performance of cloud
database anomaly detection by reducing the detrimental
impact of temporal fluctuations in KPI changes.
• We propose an adaptive threshold learning policy that ad-
dresses the challenge of auto-adjusting thresholds under
varying workloads.
• We conduct experiments under real-world and synthetic
workloads. Experimental results show that DBCatcher
improves F-Measure by 8.3%, 8.8%, and 9.2% over state-
of-the-art methods, it also accelerates detection efficiency
by 2.5× to 3×. For a 100M dataset, corresponding to
the amount of data for 120 hours of KPI data points,
DBcatcher takes only 42 seconds, which provides an
acceptable time overhead for online detection.
II. B
ACKGROUND AND PRELIMINARY STUDY
In this section, we first give the necessary background and
in-depth analysis of anomaly detection under cloud databases,
including the architecture of cloud databases, the unit key
performance indicator correlation phenomenon, and abnormal
time series trends. Then we present the challenges of anomaly
detection by applying correlation measurement methods.
A. Cloud Database Architecture
Figure 2 shows the general architecture of cloud databases.
As shown in the figure, a database cluster contains multiple
units, each unit deploys a load balance module and multiple
databases, and each database has one primary instance and
multiple replica instances. SQL requests from upper-level ap-
plications are distributed to different database units through the
global transactions manager, which are further processed by
the load balance module and forwarded to different databases
in the same unit. For read requests, they are handled equally
by different instances of each database. For write requests,
they are first processed via the primary instance and then the
data is synchronized to the remaining replica instances. When
1127
Authorized licensed use limited to: Tencent. Downloaded on April 22,2025 at 08:33:44 UTC from IEEE Xplore. Restrictions apply.
评论