
2
filter #175
filter #55
(a) image (b) feature maps (c) strongest activations
filter #66
filter #118
(a) image (b) feature maps (c) strongest activations
Figure 2: Visualization of the feature maps. (a) Two images in Pascal VOC 2007. (b) The feature maps of some
conv
5
filters. The arrows indicate the strongest responses and their corresponding positions in the images.
(c) The ImageNet images that have the strongest responses of the corresponding filters. The green rectangles
mark the receptive fields of the strongest responses.
In this paper, we introduce a spatial pyramid pool-
ing (SPP) [14], [15] layer to remove the fixed-size
constraint of the network. Specifically, we add an
SPP layer on top of the last convolutional layer. The
SPP layer pools the features and generates fixed-
length outputs, which are then fed into the fully-
connected layers (or other classifiers). In other words,
we perform some information “aggregation” at a
deeper stage of the network hierarchy (between con-
volutional layers and fully-connected layers) to avoid
the need for cropping or warping at the beginning.
Figure 1 (bottom) shows the change of the network
architecture by introducing the SPP layer. We call the
new network structure SPP-net.
We believe that aggregation at a deeper stage is
more physiologically sound and more compatible
with the hierarchical information processing in our
brains. When an object comes into our field of view,
it is more reasonable that our brains consider it as
a whole instead of cropping it into several “views”
at the beginning. Similarly, it is unlikely that our
brains distort all object candidates into fixed-size re-
gions for detecting/locating them. It is more likely
that our brains handle arbitrarily-shaped objects at
some deeper layers, by aggregating the already deeply
processed information from the previous layers.
Spatial pyramid pooling [14], [15] (popularly
known as spatial pyramid matching or SPM [15]), as
an extension of the Bag-of-Words (BoW) model [16],
is one of the most successful methods in computer
vision. It partitions the image into divisions from
finer to coarser levels, and aggregates local features
in them. SPP has long been a key component in the
leading and competition-winning systems for classi-
fication (e.g., [17], [18], [19]) and detection (e.g., [20])
before the recent prevalence of CNNs. Nevertheless,
SPP has not been considered in the context of CNNs.
We note that SPP has several remarkable properties
for deep CNNs: 1) SPP is able to generate a fixed-
length output regardless of the input size, while the
sliding window pooling used in the previous deep
networks [3] cannot; 2) SPP uses multi-level spatial
bins, while the sliding window pooling uses only
a single window size. Multi-level pooling has been
shown to be robust to object deformations [15]; 3) SPP
can pool features extracted at variable scales thanks
to the flexibility of input scales. Through experiments
we show that all these factors elevate the recognition
accuracy of deep networks.
SPP-net not only makes it possible to generate rep-
resentations from arbitrarily sized images/windows
for testing, but also allows us to feed images with
varying sizes or scales during training. Training with
variable-size images increases scale-invariance and
reduces over-fitting. We develop a simple multi-size
training method. For a single network to accept
variable input sizes, we approximate it by multiple
networks that share all parameters, while each of
these networks is trained using a fixed input size. In
each epoch we train the network with a given input
size, and switch to another input size for the next
epoch. Experiments show that this multi-size training
converges just as the traditional single-size training,
and leads to better testing accuracy.
The advantages of SPP are orthogonal to the specific
CNN designs. In a series of controlled experiments on
the ImageNet 2012 dataset, we demonstrate that SPP
improves four different CNN architectures in existing
publications [3], [4], [5] (or their modifications), over
the no-SPP counterparts. These architectures have
various filter numbers/sizes, strides, depths, or other
designs. It is thus reasonable for us to conjecture
that SPP should improve more sophisticated (deeper
and larger) convolutional architectures. SPP-net also
shows state-of-the-art classification results on Cal-
tech101 [21] and Pascal VOC 2007 [22] using only a
single full-image representation and no fine-tuning.
SPP-net also shows great strength in object detec-
tion. In the leading object detection method R-CNN
[7], the features from candidate windows are extracted
via deep convolutional networks. This method shows
remarkable detection accuracy on both the VOC and
评论