
Fig. 2: The encoded series ξ of regression residuals and the corresponding
bitwidth series b(ξ) in Example 1, for time series x in Figure 1(a) and (b)
points with both similar time and value in the series. For
example, we exchange the red points to the new positions
such that the significantly fluctuating values originally in
Figure 1(b) become more stable for regression in Figure 1(f).
Meanwhile, the corresponding nearby timestamps in Figure
1(e) are similar as well. As shown, the newly introduced
timestamp fluctuation is not as significant as that removed in
value. It is also less fluctuating than Figure 1(c), where the
points are ordered by value. In this sense, the overall residuals
of time and value are reduced in regression encoding.
B. Solution
Note that timestamps are crucial in IoT applications. Hence,
in regression encoding, both timestamps and values are en-
coded and stored, in the same order of data points. For data
points ordered by timestamps in Figure 1(a)(b), while the
timestamp storage is effective, the corresponding values have
large fluctuations and thus large auto-regression residuals. On
the other hand, for data points ordered by values in Figure
1(c)(d), the value storage is efficient, whereas the correspond-
ing timestamps suffer large fluctuations and again large auto-
regression residuals. To balance the trade-off, we propose to
find a better ordering of data points (by Algorithm 1) that may
have smaller residuals of auto-regression on timestamps and
values, respectively, and consequently the total storage usage.
Since both timestamps and values are encoded and stored,
they can be naturally decoded, respectively, by adding the
auto-regression model predictions with the stored residuals.
The decoded data points of timestamps and values are then
ordered again on timestamps, recovering exactly the original
time series by Algorithm 2. Therefore, REGER is lossless and
will not destroy the downstream tasks.
C. Contribution
Our major contributions in this study are as follows.
TABLE I: Notations
Notation Description
x a time series
n the number of values in a block of time series
q the number of values in a pack of time series
x
(t)
, x
(v)
timestamps and values of time series x
f
(t)
, f
(v)
regression models for timestamp and value
ǫ residual series of x
ξ encoded series o f ǫ
b(ξ) bitwidth series of ξ
x
A
x reordered by order A
x
α→β
reorder x by moving α-th point to position β
g(j ) points affected by moving a point from α
h(j ) points affected by moving a point to j
H (ξ
g(j )
),H (ξ
h(j )
) residual cost of g(j ) and h(j ) points in ξ
H (ξ
[1,n]\g(j )\h(j )
) residual cost of unchanged points in ξ
p the order of the autoregressive model
(1) We formalize the problem of time series reordering
for regression encoding, in Section II. Finding the optimal
reordering of time series for regression encoding, however,
is costly and not affordable during the extremely fast data
ingestion, especially in t he IoT scenarios.
(2) We propose an efficient heuristic algorithm that moves
one point at a time, leading to lower residuals, in Section III.
Rather than recomputing from scratch, an efficient incremental
strategy is devised for updating residuals after point moving.
(3) Remarkably, the proposed method, REGression Encod-
ing with Reordering (REGER), has now become an encoding
method in an open-source time series database, Apache IoTDB
[18], [34]. We present the system deployment in Section IV.
(4) We conduct extensive experiments over various real-
world datasets. The results demonstrate that the reordered time
series has better compression ratio than the natural time order.
The code of REGER method has been included in the
system repository of Apache IoTDB [6]. The experiment
related code and data are available in [7] for reproducibility.
Table I lists the frequently used notations.
II. ENCODING METHOD
Since large fluctuations like abnormal behaviors often ap-
pear in t ime series, the residuals will be very l arge for regres-
sion encoding. It leads to more bit widths for storing residuals,
higher space cost. Thus, we propose a method that reorders
the time series, which could reduce the bit widths of residuals.
In the following, we introduce regression encoding in Section
II-B and time s eries reordering problem in Section II-C.
A. Preliminary
Consider a time series of n data points. It usually corre-
sponds to a block in the database management system for data
encoding, as introduced in Section IV below.
Definition 1 (Time Series): A time series is denoted by two
sequences of time and value,
x = (x
(t)
, x
(v)
),
where x
(t)
contains n timestamps, and x
(v)
has n values.
x
(t)
= (x
(t)
1
, x
(t)
2
, . . . , x
(t)
n
), x
(v)
= (x
(v)
1
, x
(v)
2
, . . . , x
(v)
n
).
2
评论