
izes the components to similar classes and enables condi-
tional execution for faster inference.
2. Related Work
Sparsity. The common approach to improving DNN effi-
ciency is to enforce sparsity in network connectivity [
21, 29,
4, 13]. This can be achieved via manual design of new DNN
modules (Inception [
27], SqueezeNet [16], MobileNet [13])
or via automated techniques that identify and remove the
least important connections from a dense network [
11, 10].
In either case, determination of the network topology is a
static preprocessing step, and all connections are evaluated
at the time of inference.
A complimentary optimization is to employ conditional
execution at the time of inference to exploit sparsity in ac-
tivations (skipping computation and memory accesses for
model weights when activations are known to be zero).
While attractive for reducing the energy consumption of
DNN accelerators [
9], fine-grained, per-element sparsity is
difficult to exploit on CPUs and GPUs, which rely heavily
on wide vector processing for performance. The subtask
specialization we exploit in HydraNets can be viewed as a
mechanism for designing and training a network architec-
ture that, through coarse-grained conditional execution, is
able to more effectively exploit dynamic sparsity.
Cascades. Cascades [
28] are a common form of condi-
tional model execution that reduces inference cost (on av-
erage) by quickly terminating model evaluation on inputs
that are easy to process (“early out”). Region proposal [
24]
based models for object detection are canonical example of
cascades in DNNs. Recent work has shown that integrat-
ing cascades into deep network architectures [
8, 14] can im-
prove the accuracy vs. cost of state-of-the-art architectures,
where the later stages in a cascade specialize for difficult
problem instances. The HydraNet approach of specializ-
ing network components for different subtasks is orthogonal
and complementary to the benefits of cascades.
Mixture of experts. The idea of specializing components
of a model for different subtasks is related to mixture-of-
experts models where the experts are specialized for dif-
ferent inputs or tasks. Recent work on training very large
DNNs for language modeling [
26] has used conditional ex-
ecution of experts for evaluating only a small fraction of ex-
perts for each training instance. One of the key aspects ad-
dressed in [
26] is the design of the mechanism for choosing
which experts to evaluate and trade-offs in network archi-
tecture to maintain computational efficiency. These design
choices are tailored for recurrent models and cannot be di-
rectly applied to state-of-the-art image classification models
which are feed-forward convolutional networks.
Hierarchical classification. Categories in ImageNet [
6,
25] are organized into a semantic hierarchy using an ex-
ternal knowledge base. The hierarchy can be used to first
predict the super class and only perform fine grained classi-
fication within the super class [
5, 7]. HDCNN [30] is a hier-
archical image classification architecture which is similar in
spirit to our approach. HDCNN and Ioannou et al. [20, 17]
improve accuracy with significant increase in cost relative
to the baseline architecture. In contrast, HydraNets on Ima-
geNet improve top-1 accuracy by 1.18-2.5% with the same
inference cost as corresponding baseline architectures.
Both HDCNN and Ioannou et al. model the routing
weights as continuous variables which are used to linearly
combine outputs from multiple experts. Jointly learning
the routing weights with the experts is similar to LEARN
in Table 8, and performs poorly due to optimization diffi-
culties (collapse and poor utilization of experts). HDCNN
uses complex multi-stage training to mitigate optimization
issues and provides robustness to routing error by overlap-
ping the classes handled by each expert. HydraNets use
binary weights for experts during both training and infer-
ence by dropping out all but the top-k experts. This enables
joint training of all the HydraNet components while allow-
ing flexible usage of experts.
Architectural structures similar to HydraNet [
1] have
been used for learning the partition of categories into dis-
joint subsets. Our main contribution is a gating mecha-
nism which reduces inference cost by dynamically choosing
components of the network to evaluate at runtime. Recent
work [
22, 23] has explored directly incorporating inference
cost in the optimization and explore training methods for
jointly learning the routing and the network features. In
contrast to the complex training regime required for joint
learning, our approach enables a simple and effective train-
ing strategy which we comprehensively evaluate cost on
both ImageNet and CIFAR-100 datasets.
3. HydraNet Architecture Template
The HydraNet template, shown in Figure
1, has four ma-
jor components.
• Branches which are specialized for computing features
on visually similar classes. We view computing fea-
tures relevant to a subset of the network inputs as a
subtask of the larger classification task.
• A stem that computes features used by all branches and
in deciding which subtasks to perform for an input.
• The gating mechanism which decides what branches
to execute at inference by using features from the stem.
• A combiner which aggregates features from multiple
branches to make final predictions.
Realizing the HydraNet template requires partitioning
the classes into visually similar groups that the branches
specialize for, an accurate and cost-effective gating mech-
anism for choosing branches to execute given an input, and
8081
评论