CNN-based multi-frame IMO detection from a monocular camera.pdf

aKun

320

8页

0次

2021-02-23

50墨值下载

CNN-based multi-frame IMO detection from a monocular camera

Nolang Fanani

, Matthias Ochs

, Alina St

urck

, Rudolf Mester

1,2

Abstract— This paper presents a method for detecting inde-

pendently moving objects (IMOs) from a monocular camera

mounted on a moving car. A CNN-based classiﬁer is employed

to generate IMO candidate patches; independent motion is

detected by geometric criteria on keypoint trajectories in these

patches. Instead of looking only at two consecutive frames, we

analyze keypoints inside the IMO candidate patches through

multi-frame epipolar consistency checks. The obtained motion

labels (IMO/static) are then propagated over time using the

combination of motion cues and appearance-based information

of the IMO candidate patches. We evaluate the performance of

our method on the KITTI dataset, focusing on sub-sequences

containing IMOs.

I. INTRODUCTION

Classical visual odometry methods rely strongly on the

rigidity of the depicted environment through which a robot

(or car) moves. Deviations from this rigidity (e.g. other

robots, cars, pedestrians, etc.) are traditionally excluded from

egomotion analysis by using variants of RANSAC. Recently,

methods have been proposed which are robust against other

moving objects without RANSAC [5], [6]. However, these

methods just ’blank out’ all areas which are not conformant

to the epipolar geometry induced by the ego-motion, but they

do not analyze these areas further.

In the present paper, we present an approach which is

kind of ’piggy-backed’ on the mentioned recently developed

propagation based tracking method (PbT) [6], and employs

a CNN to produce candidate patches that correspond to

single vehicle instances, thus potential independently moving

objects (IMOs). In the presented scheme, these patches are

subsequently associated with each other over time using a

dynamic motion model and simple appearance descriptors,

and the conformity of these association with the epipolar

geometry computed by PbT is checked. As a result, those

IMOs that are moving differently from the motion of the

ego-car can be detected.

There are two main contributions of our work. First, we

propose a monocular IMO detection scheme which relies on

multi-frame epipolar consistency checks. Hence, the case of

epipolar-conformant IMOs is not in the scope of our work.

Second, we propose a way to propagate and associate IMO

candidates over time by combining the motion model with

appearance information.

In this paper, after having presented related work, we

present the detection and association principles used for this

approach. Perspectives for additionally detecting and track-

ing ’epipolar-conformant’ cars are provided in section VI-C.

Visual Sensorics & Information Processing Lab, Goethe University

Frankfurt am Main, Germany

Computer Vision Laboratory, ISY, Link

oping University, Sweden

Fig. 1. Top: The scheme of the proposed IMO detection. Bottom: An ex-

ample of car classiﬁcations into static (green), IMO (red) and undetermined

(yellow).

We conclude with an evaluation of the method on the KITTI

dataset.

II. RELATED WORK

The detection of independently moving objects (IMOs)

from visual sensors is a vitally important part of many

computer vision systems. A traditional application can be

found in visual surveillance, where the camera is static, but

recently, object detection from moving cameras is becoming

more inﬂuential.

In this work we will focus on the detection of moving

cars from a vehicle mounted camera. This scenario is very

different to others such as handheld cameras (such as [12]) or

general robot vision (such as [11]) in that motion is severely

restricted.

In the development of other advanced driver assistance

systems (ADAS) several approaches have been proposed.

Many of those choose to work with additional information

such as color images ([2], [17]) or a stereo system [24], [13].

In contrast to these we want to show that it is possible to

reliably detect IMOs from a simple monocular camera.

Previously published monocular algorithms can be differ-

entiated into two categories. Appearance-based approaches

([11], [15]) are often based on patch-matching or learning-

techniques to determine the movement of any visible object.

2018 IEEE Intelligent Vehicles Symposium (IV)

Changshu, Suzhou, China, June 26-30, 2018

Among such approaches are [27] who use features combined

with a classiﬁcation process for vehicle description or [28]

who use an attention-inspired model to subtract less impor-

tant image regions to obtain the moving foreground region.

On the other hand, motion-based approaches ([19], [26],

[10]) aim to work with optical ﬂow and other geometric

constraints to determine varying motion patterns from IMOs.

We aim to provide an approach that combines cues from

both the appearance of a car by employing a CNN-based

detection as well as motion cues from optical ﬂow based

on the epipolar geometry to determine the presence of an

independently moving object and track it. In this aspect our

approach shares some similarities with [16] who use two

separate CNNs to determine visual odometry and object lo-

calisation and fuse their results to obtain object localisations

and also with [25] who use CNNs to obtain a rigidity score

for each object and combine this with motion cues from

optical ﬂow. In contrast to their approach, we additionally

use a series of hypothesis tests on keypoints lying on patches

that have previously been identiﬁed as belonging to a vehicle

in manner similar to [23] to track the regarded vehicle

through consecutive frames. Bai et al [1] identiﬁes IMOs by

estimating the dense optical ﬂows from each IMO candidate.

In contrast, our approach utilizes sparse keypoints to identify

static and IMOs.

III. FRAMEWORK OVERVIEW

The overall ﬂow of the proposed method is presented

in ﬁgure 1 (top image). It builds on a monocular visual

odometry framework, the propagation based tracking (PbT)

scheme [7]. An important principle of PbT is that the new

relative pose for a new frame n+1 is predicted using the car

ego-dynamics, and that a reﬁned relative pose is computed

only on the basis of keypoints that have been tracked at least

twice, that is: keypoints which already passed a stringent test

of being belonging to the static environment. All keypoints,

including the new ones generated in sparsely covered areas

of a new frame, are tracked in an epipolar-guided manner as

discussed more detailed in section III-A.

In the present scheme, we detect IMO candidate patches

for each new frame using the instance segmentation scheme

of van den Brand et al. [22]. This results in M new IMO

candidate patches for each new frame n + 1. The generation

of the CNN-based IMO candidate patches is discussed in

section III-B.

All IMO candidate patches in image n are classiﬁed in

one of the three states static, IMO, or undetermined.

Keypoints that are located in IMO candidate patches are con-

sidered for pose estimation only if they have been classiﬁed

as static.

When a new frame n + 1 comes in, it is processed by the

CNN in order to determine the IMO candidate patches. At

the same moment, we have a set of old tracked keypoints

in previous frame n. Some of those are already conﬁrmed

as static as they have been tracked at least twice and

have shown 3D consistency (see section IV-A). Others have

been tracked just once (from frame n − 1 to n) and are only

candidates for being considered as static.

Subsequently, the relative pose between frame n and

n + 1 is determined from all keypoints which are considered

to be static in frame n in the standard manner used in

propagation based tracking. With the resulting relative pose

n → n + 1 being computed, individual keypoints can

then be checked for being conformant with the epipolar

geometry. By accumulating the checks from all keypoints

on a patch, this leads to the classiﬁcation of that patch into

IMO, static, or undetermined. This IMO detection

procedure is described in detail in section IV-A and section

IV-B.

For each of the new IMO candidate patches in frame

n + 1, an association with the existing patches in frame n

must be performed in order to propagate the motion state

(static/IMO) over time. The association can be made

appearance-based, that is: by comparing size and texture of

the patches, or tracking based, that is: by checking matching

residuals between keypoints inside of the patches. The inter-

frame car patch association is discussed in section V.

A. PbT framework

As said, we build our scheme on the monocular visual

odometry framework using propagation-based tracking (PbT)

proposed in [5], [6]. Principles of keypoint generation and

tracking are used in a similar way here, thus we give some

details in the following. The egomotion of the ego-car is

estimated using keypoints which have been conﬁrmed to

be static. These keypoints are the combination of keypoints

outside CNN-based car masks and keypoints from car masks

that have been classiﬁed as static cars. In addition, PbT with

its epipolar constraint is able to propagate the static label of

a car mask on subsequent frames as long as the keypoints

inside that car mask are successfully tracked.

We ﬁnd keypoints inside each IMO candidate patch by

choosing both corner and edgel points, as proposed in [18].

This approach is an extension of the good-feature-to-track

(GFTT) detector initially proposed by Shi and Tomasi [20].

As the matching and tracking processes used in the present

paper are guided by the epipolar geometry, patches which

have a local structure with only one dominant orientation

(e.g. lines and straight edges) can be matched as long as the

dominant orientation is sufﬁciently well inclined relative to

the epipolar line under consideration.

Each keypoint is represented by a 15 × 15 patch centered

on the keypoint and we use a 2D Gaussian ﬁlter with

the same size of the patch as a masking weight W for

each patch. In order to track the keypoint on subsequent

frames, we employ an iterative matching which minimizes

the photometric error between the patch correspondences.

When a new keypoint is tried to be matched for the ﬁrst

time, we initialize the matching with motion prior informa-

tion as proposed in [4] followed by a Lucas-Kanade like

optimization to ﬁnd the ﬁnal match. The matching results of

a keypoint on consecutive frames form a keypoint trajectory.

A keypoint is ﬁnally accepted and used for pose estimation

958