
MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving
Marvin Teichmann
123
, Michael Weber
2
, Marius Z
¨
ollner
2
, Roberto Cipolla
3
and Raquel Urtasun
14
1
Department of Computer Science, University of Toronto
2
FZI Research Center for Information Technology, Karlsruhe
3
Department of Engineering, University of Cambridge
4
Uber Advanced Technologies Group
marvin.teichmann@googlemail.com, Michael.Weber@fzi.de,
zoellner@fzi.de, rc10001@cam.ac.uk, urtasun@cs.toronto.edu
Abstract— While most approaches to semantic reasoning
have focused on improving performance, in this paper we argue
that computational times are very important in order to enable
real time applications such as autonomous driving. Towards this
goal, we present an approach to joint classification, detection
and semantic segmentation using a unified architecture where
the encoder is shared amongst the three tasks. Our approach is
very simple, can be trained end-to-end and performs extremely
well in the challenging KITTI dataset. Our approach is also
very efficient, allowing us to perform inference at more then
23 frames per second.
Training scripts and trained weights to reproduce
our results can be found here:
https://github.com/
MarvinTeichmann/MultiNet
I. INTRODUCTION
Current advances in the field of computer vision have
made clear that visual perception is going to play a key role
in the development of self-driving cars. This is mostly due to
the deep learning revolution which begun with the introduc-
tion of AlexNet in 2012 [29]. Since then, the accuracy of new
approaches has been increasing at a vertiginous rate. Causes
of this are the existence of more data, increased computation
power and algorithmic developments. The current trend is to
create deeper networks with as many layers as possible [22].
While performance is already extremely high, when deal-
ing with real-world applications, running time becomes im-
portant. New hardware accelerators as well as compression,
reduced precision and distillation methods have been ex-
ploited to speed up current networks.
In this paper we take an alternative approach and design
a network architecture that can very efficiently perform
classification, detection and semantic segmentation simulta-
neously. This is done by incorporating all three tasks into a
unified encoder-decoder architecture. We name our approach
MultiNet.
The encoder is a deep CNN, producing rich features that
are shared among all task. Those features are then utilized
by task-specific decoders, which produce their outputs in
real-time. In particular, the detection decoder combines the
fast regression design introduced in Yolo [45] with the size-
adjusting ROI-align of Faster-RCNN [17] and Mask-RCNN
[21], achieving a better speed-accuracy ratio.
Fig. 1: Our goal: Solving street classification, vehicle detec-
tion and road segmentation in one forward pass.
We demonstrate the effectiveness of our approach in the
challenging KITTI benchmark [15] and show state-of-the-
art performance in road segmentation. Importantly, our ROI-
align implementation can significantly improve detection
performance without requiring an explicit proposal gener-
ation network. This gives our decoder a significant speed
advantage compared to Faster-RCNN [46]. Our approach is
able to benefit from sharing computations, allowing us to
perform inference in less than 45 ms for all tasks.
II. RELATED WORK
In this section we review current approaches to the
tasks that MultiNet tackles, i.e., detection, classification and
semantic segmentation. We focus our attention on deep
learning based approaches.
a) Classification: After the development of AlexNet
[29], most modern approaches to image classification utilize
deep learning. Residual networks [22] constitute the state-
of-the-art, as they allow to train very deep networks without
problems of vanishing or exploding gradients. In the context
of road classification, deep neural networks are also widely
employed [37]. Sensor fusion has also been exploited in this
context [50]. In this paper we use classification to guide other
semantic tasks, i.e., segmentation and detection.
b) Detection: Traditional deep learning approaches to
object detection follow a two step process, where region
proposals [31], [25], [24] are first generated and then scored
using a convolutional network [18], [46]. Additional perfor-
mance improvements can be gained by using convolutional
neural networks (CNNs) for the proposal generation step
[10], [46] or by reasoning in 3D [6], [5]. Recently, several
2018 IEEE Intelligent Vehicles Symposium (IV)
Changshu, Suzhou, China, June 26-30, 2018
978-1-5386-4451-5/18/$31.00 ©2018 IEEE 1013
评论