
Published as a conference paper at ICLR 2017
TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED
LEARNING
Samuli Laine
NVIDIA
slaine@nvidia.com
Timo Aila
NVIDIA
taila@nvidia.com
ABSTRACT
In this paper, we present a simple and efficient method for training deep neural
networks in a semi-supervised setting where only a small portion of training data
is labeled. We introduce self-ensembling, where we form a consensus prediction
of the unknown labels using the outputs of the network-in-training on different
epochs, and most importantly, under different regularization and input augmenta-
tion conditions. This ensemble prediction can be expected to be a better predictor
for the unknown labels than the output of the network at the most recent training
epoch, and can thus be used as a target for training. Using our method, we set
new records for two standard semi-supervised learning benchmarks, reducing the
(non-augmented) classification error rate from 18.44% to 7.05% in SVHN with
500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further
to 5.12% and 12.16% by enabling the standard augmentations. We additionally
obtain a clear improvement in CIFAR-100 classification accuracy by using ran-
dom images from the Tiny Images dataset as unlabeled extra inputs during train-
ing. Finally, we demonstrate good tolerance to incorrect labels.
1 INTRODUCTION
It has long been known that an ensemble of multiple neural networks generally yields better pre-
dictions than a single network in the ensemble. This effect has also been indirectly exploited when
training a single network through dropout (Srivastava et al., 2014), dropconnect (Wan et al., 2013),
or stochastic depth (Huang et al., 2016) regularization methods, and in swapout networks (Singh
et al., 2016), where training always focuses on a particular subset of the network, and thus the com-
plete network can be seen as an implicit ensemble of such trained sub-networks. We extend this idea
by forming ensemble predictions during training, using the outputs of a single network on different
training epochs and under different regularization and input augmentation conditions. Our train-
ing still operates on a single network, but the predictions made on different epochs correspond to an
ensemble prediction of a large number of individual sub-networks because of dropout regularization.
This ensemble prediction can be exploited for semi-supervised learning where only a small portion
of training data is labeled. If we compare the ensemble prediction to the current output of the net-
work being trained, the ensemble prediction is likely to be closer to the correct, unknown labels of
the unlabeled inputs. Therefore the labels inferred this way can be used as training targets for the
unlabeled inputs. Our method relies heavily on dropout regularization and versatile input augmen-
tation. Indeed, without neither, there would be much less reason to place confidence in whatever
labels are inferred for the unlabeled training data.
We describe two ways to implement self-ensembling, Π-model and temporal ensembling. Both ap-
proaches surpass prior state-of-the-art results in semi-supervised learning by a considerable margin.
We furthermore observe that self-ensembling improves the classification accuracy in fully labeled
cases as well, and provides tolerance against incorrect labels.
The recently introduced transform/stability loss of Sajjadi et al. (2016b) is based on the same prin-
ciple as our work, and the Π-model can be seen as a special case of it. The Π-model can also be
seen as a simplification of the Γ-model of the ladder network by Rasmus et al. (2015), a previously
presented network architecture for semi-supervised learning. Our temporal ensembling method has
connections to the bootstrapping method of Reed et al. (2014) targeted for training with noisy labels.
1
arXiv:1610.02242v3 [cs.NE] 15 Mar 2017
评论