
728 T. Xia et al.
transforming them to a more concise form, e.g., the small
difference between two integers. In this study, we propose
to explore the features specialized for encoding time series,
under some intuitions like values usually not changing sig-
nificantly over time, i.e., small delta.
Different data features lead to different encoding perfor-
mances. In this paper, we present a comparative analysis of
time series data encoding techniques in Apache IoTDB, an
open-source time series database developed in our prelim-
inary studies [53]. Since the decoding process is often the
reverse of the corresponding encoding, we mainly focus on
the encoding part and omit the similar decoding part. Our
major contributions are summarized as follows.
(1) We summarize time series data features that may affect
the performance of encoding in Sect. 2. Intuitively, the
scale of values is an important factor of storage. Like-
wise, when storing the delta between two consecutive
values, it becomes a key issue. The number of value
repeats and increases are also essential to some encod-
ing ideas. The results of the latest feature extraction are
also employed as the data features.
(2) We present a qualitative analysis of encoding effective-
ness regarding to various data features in Sect. 3.It
covers classical algorithms such as RLE and also recent
proposals like BUFF [41]. While there is no winner in all
the data features, TS_2DIFF performs well in a number
of cases. For the cases where TS_2DIFF shows worse
results, such as repeat rate 1 in Fig. 24, it may be less
frequent and not that significant in practice.
(3) We devise a benchmark for time series data encoding.
It consists of (a) data generators for simulating various
data features, (b) several real-world datasets, public or
collected by our industrial partners, (c) metrics such as
compression ratio (space cost after encoding and com-
pressing divided by original space cost). I n particular,
multiple features could vary at the same time in the gen-
erator, such as large values but small deltas, so that the
distinct cases favored by different algorithms could be
illustrated. Moreover, it supports both numerical and text
values.
(4) We conduct an extensive experimental evaluation in
Sect. 6. The quantitative analysis generally verifies the
aforesaid qualitative analysis of encoding performance
regarding to various data features.
(5) We propose TSEC, a machine learning tool to recom-
mend encoding algorithms by data features in Sect. 7.
We choose the best classifier for TSEC from popular
ones, and compare TSEC with others.
Finally, we also discuss some related work in Sect. 8, and out-
line some future directions in Sect. 9 referring to the analysis.
The source code of encoding algorithms has been deployed
Fig. 1 Example of real data with distinct features on a large/small
scale, b large/small delta, c vast/rare repeats and d vast/rare increases,
affecting encoding performance
in the GitHub repository of Apache IoTDB [22]. The exper-
iment related code and data are available in [23].
2 Data features
To be able to analyze how the encoding algorithms perform in
different data, and consequently recommend proper encod-
ing algorithms, we select several features that may affect
the performance of encoding, including scale, delta, repeat
and increase. Figure 1 presents some typical examples of (a)
large/small scale, (b) large/small delta, (c) vast/rare repeats
and (d) vast/rare increases.
These features reflect the characteristics of data values in
terms of size, change, repetition and trend. For simplicity,
we use TS =[v
1
,v
2
,...,v
n
] to denote the value list of
time series. In the following subsections, we will define and
analyze each feature in detail.
2.1 Scale for numerical data
The scale of data is one of the most important factors in
storage. In general, the larger the values are, the more bits
we need to encode them. The scale of data also affects the
performance of different compression methods. For example,
some methods need to store a header for each value, which
requires more bits for larger values.
To this end, we employ the mean, standard deviation and
spread (maximum minus minimum) of the values in time
series TS, denoted by Mean(TS), Std(TS), and Spread(TS)
to represent the scale features.
In addition to the traditional features on scale, we consider
another SB_BinaryStats_mean_longstretch1(TS), abbrevi-
123
评论