
urgent and challenging, with the contradiction of dependencies
between complete data and imputed values. For instance, if
only those complete values in Figure 1 are involved in training
imputation models, they could be insufficient to represent
statistics of missing values x
2
[A
3
], x
7
[A
1
] and x
7
[A
2
].
B. Solution
Considering the aforesaid challenges and limitations of
existing techniques, we study the collaborative imputation with
convergence guarantee to impute missing values in multivari-
ate time series. The benefits are in two aspects.
(1) Both dependent and determinant missing values can
be collaboratively imputed with the convergence guarantee,
rather than filled independently. For instance, as shown in
Figure 1, x
7
[A
1
] and x
7
[A
2
] are collaboratively imputed 50
rounds until convergence, where the fillings become more
accurate with the increasing rounds. Notably, to improve
the imputation efficiency, we further consider the parallel
computation strategies for missing cells involved in disjoint
dependency models. For instance, the red and blue boxes in
Figure 1 represent two distinct imputation contexts that can
be processed parallelly, where the observations within the red
box construct the imputation contexts to impute x
2
[A
3
], and
the observations within the blue box make up the imputation
contexts to fill missing values in x
7
[A
1
] and x
7
[A
2
].
(2) Both missing values and imputation models can be col-
laboratively optimized with the certified convergence, instead
of using fixed models. For instance, not only the complete val-
ues are utilized for the model training, but also the incomplete
cells x
2
[A
3
], x
7
[A
1
] and x
7
[A
2
] are collaboratively involved
in the model optimization according to fillings. That is, our
work can overcome the assumption of existing studies about
the accurate imputation models trained over only complete
data. Moreover, the algorithm is also extended into a parallel
version, where models are asynchronously updated.
Example 1: Figure 1 show some example data from the
AirQuality dataset. Given the temporal window size ω =1,we
establish imputation contexts for missing cells x
2
[A
3
], x
7
[A
1
]
and x
7
[A
2
], as marked in red and blue boxes respectively. To
impute missing values in each box, the most successful and
easy-to-use vector autoregressive (VAR) models [7] can be
employed to capture the dependencies between temporal data
within each designated box, since they are usually used to
model the relationships between adjacent data. The star lines
with different sizes show our collaborative imputation results
after different rounds, where fillings are initialized by recent
attribute values. As for the single missing cell x
2
[A
3
], both
VAR models and our work could obtain a near-optimal filling.
Unfortunately, when there are multiple missing cells x
7
[A
1
],
x
7
[A
2
] in both determinant and dependent attributes, directly
using VAR is no longer available. In contrast, our collaborative
imputation with the convergence guarantee could gradually
compute accurate fillings for them until converged.
The example illustrates that, with the convergence guaran-
teed collaborative imputation, we can optimize both imputa-
tion values and models for the multivariate time series.
Fig. 2. The complete imputation context C
41
and the incomplete C
72
C. Contributions
Our main technical contributions are as follows.
1) We formalize the likelihood of the imputed multivariate
time series w.r.t. dependency models in Section II. The statis-
tically explainable collaborative imputation by the likelihood
is then derived, which demonstrates the rationale of our work.
2) We design a sequential collaborative imputation algo-
rithm towards maximizing the likelihood with the convergence
guarantee (Proposition 1) in Section IV-B. It is further ex-
tended into a parallel version based on whether the imputation
contexts are connected in Section IV-C, which ensures to
return the same result with the sequential algorithm for fixed
updates (Proposition 3).
3) We improve the algorithm with dynamic models to meet
the challenge from many missing values in Section V, whose
convergence is also ensured (Proposition 4). For efficiency,
the algorithm is also improved into a parallel version, with
the convergence guarantee (Propositions 6 and 7) w.r.t. both
fillings and dynamic models.
4) We consider the optimizations of our algorithms in
Section VI, including the streaming imputation for real-time
scenarios and the adaptive parameter determination strategy.
Various real incomplete datasets are employed in experi-
ments in Section VII, verifying the superiority of our work.
II. F
OUNDATIONS
In this section, we first formalize the imputation contexts
and dependency models. The likelihood is then studied for fill-
ings w.r.t. the models. The collaborative imputation for various
missing values, which is statistically explainable referring to
the maximum likelihood estimation, is formally studied.
A. Imputation Contexts and Dependencies
Consider an incomplete multivariate time series I = {x
1
,
...,x
n
} over schema R =(A
1
,...,A
m
), with a timestamp
t
i
for each tuple x
i
∈ I . Each x
i
∈ I contains a collection
of cells {x
i
[A
1
],,...,x
i
[A
m
]}, where x
i
[A
j
], or simply x
ij
,
denotes the value of attribute A
j
in the i-th tuple. The null
cell on attribute A
j
at t
i
is x
ij
= ⊥.Afilling I
of I is also an
instance of R such that existing non-null cells do not change.
To impute missing values in x
i
∈ I , we may refer to
the latest ω tuples, e.g., {x
lj
| i − ω ≤ l < i + ω, 1 ≤
j ≤ m}. Because there may exist both intratemporal and
intertemporal dependencies in multivariate time series. For
instance, in Figure 1, there is the intertemporal dependency
on temperature data over time, as well as the intratemporal
dependency between NOx, humidity and temperature attribute
values. Enlightened by the existing study [57] setting a window
size to repair time series data, we establish the imputation
context for each cell x
ij
, to abstract such dependencies.
1112
文档被以下合辑收录
评论