
signicantly distract the SD statistics but not MAD, as shown in
Figure 1(b). The K-sigma rule [
18
] with
𝐾 =
3 detects elements
outside the ranges of solid lines in the distribution as outliers, in
Figure 1(d). Such ranges of solid lines can either be determined
by MEAN and SD (in red), or by MEDIAN and MAD (in green).
As shown in Figure 1(d), the ranges determined by MEDIAN and
MAD are more robust and closer to those over the clean data in
Figure 1(c). The outlier detection by MEDIAN and MAD is thus
more accurate [
6
]. For instance, the outlier at time 19:00 with value
125, outside the range of green lines, will be detected by MEDIAN
and MAD, but not by MEAN and SD in red lines. ■
Unfortunately, computing the exact MAD over a set of
𝑛
ele-
ments is costly, since it is dened on the median. Unlike SD, which
can be eciently calculated in constant space, the MAD computa-
tion needs to nd the median of the absolute deviations from the
dataset’s median, i.e., nding median twice. Such an exact MAD
algorithm has the space cost
𝑂 (𝑛)
and its time cost depends heavily
on data size, which is not aordable for large scale data. Therefore,
approximate MAD is often considered.
Example 1.2. To demonstrate the suciency and superiority of
approximate MAD compared with exact MAD, we run the outlier
detection application in Example 1.1 over a larger data size with up
to 10
7
elements. Figure 2 compares the results of approximate MAD
(by Algorithm 2 in Section 3.2) and exact MAD (by Algorithm 3 in
the full version technical report [1]).
(1) Comparable eectiveness. Figure 2(a) with
𝐾 =
7 shows that
the F1-score of outlier detection with approximate MAD is almost
the same as that using exact MAD. That is, it is sucient to use
approximate MAD in this application.
(2) Lower time cost. Figure 2(b) presents the corresponding time
costs of computing approximate MAD and exact MAD in a batch
mode. While their eectiveness of outlier detection is almost the
same as shown in Figure 2(a), the time cost of computing approxi-
mate MAD is only about 1/5 of exact MAD over 10
7
elements.
(3) Lower space cost. Not only is the time cost saved, as illus-
trated in Figure 2(c), the corresponding space cost of computing
approximate MAD is signicantly lower, almost constant. Given
the almost the same eectiveness but much more ecient time and
space costs, it is sucient to use approximate MAD in practice and
the costly exact MAD is not necessary.
(4) Streaming computation. As presented in Section 3.2, the algo-
rithm for computing approximate MAD can be naturally adapted to
incremental computation for streaming data. In contrast, the exact
MAD needs to scan the entire data, since the set of absolute devia-
tions needs to be reconstructed given a changed median. Therefore,
as illustrated in Figure 2(d), the time cost of computing approximate
MAD is constant in the streaming mode, while that of exact MAD
increases heavily. It demonstrates again the necessity of studying
approximate MAD.
Similarly, the superiority of approximate MAD compared to
exact MAD is also observed in another application of Shewhart
control charts in Figure 11 in Section 5.4. ■
To nd approximate MAD, a natural idea is to nd the approxi-
mate median of the absolute deviations from the dataset’s approxi-
mate median, by calling twice the existing methods of approximat-
ing median [
15
,
21
]. With two median approximations, the error
Figure 2: Superiority of approximate MAD over exact MAD.
(a) Approximate MAD and exact MAD show almost the same
F1-score of outlier detection under various data sizes. The
time costs in (b) and space costs in (c) of computing approxi-
mate MAD are signicantly lower than those of exact MAD,
in a batch mode. (d) The approximate MAD can also be in-
crementally computed in a streaming mode.
introduced by such a simple approximation relative to the exact
MAD is theoretically unknown and practically ineective (see a
detailed discussion in Section 6.1).
To the best of our knowledge, this is the rst study on approxi-
mating MAD with bounded error. Our contributions include
•
an approximate MAD algorithm with explicit error bound,
which is fully mergeable and needs only 1-pass scan of data;
•
a solution by calling the proposed algorithm twice, which is
parallelizable and guarantees to nd an (
𝜖
,1)-accurate MAD;
•
a pruning strategy to skip buckets in the sketch of the second
scan, signicantly reducing the space cost;
•
a distributed implementation in Apache Spark and a stream-
ing implementation in Structured Streaming, both outper-
forming the simple MAD approximation using the built-in
approxQuantile for the approximate median.
Following is the structure of our paper. In Section 2, we rst
give the formal denition of MAD and the measure of approxi-
mate MAD in accuracy. In Section 3, we present the approximate
MAD algorithm and analyze its error bound as well as complex-
ity. In Section 4, we illustrate the solution for approximate MAD
with desired error guarantees, and the bucket pruning strategy.
In Section 5, we compare our solution with several baselines in
experiments, showing the superiority of our proposal in accuracy
and resource occupation. Finally, in Section 6 and Section 7, we
discuss the related work and the conclusion.
Table 1 lists the frequently used notations.
2 FRAMEWORK
In this section, we review the MAD problem together with an exact
computation algorithm, and introduce approximate MAD as well
as the framework of our approximation solution.
2115
文档被以下合辑收录
评论