Among such approaches are [27] who use features combined
with a classification process for vehicle description or [28]
who use an attention-inspired model to subtract less impor-
tant image regions to obtain the moving foreground region.
On the other hand, motion-based approaches ([19], [26],
[10]) aim to work with optical flow and other geometric
constraints to determine varying motion patterns from IMOs.
We aim to provide an approach that combines cues from
both the appearance of a car by employing a CNN-based
detection as well as motion cues from optical flow based
on the epipolar geometry to determine the presence of an
independently moving object and track it. In this aspect our
approach shares some similarities with [16] who use two
separate CNNs to determine visual odometry and object lo-
calisation and fuse their results to obtain object localisations
and also with [25] who use CNNs to obtain a rigidity score
for each object and combine this with motion cues from
optical flow. In contrast to their approach, we additionally
use a series of hypothesis tests on keypoints lying on patches
that have previously been identified as belonging to a vehicle
in manner similar to [23] to track the regarded vehicle
through consecutive frames. Bai et al [1] identifies IMOs by
estimating the dense optical flows from each IMO candidate.
In contrast, our approach utilizes sparse keypoints to identify
static and IMOs.
III. FRAMEWORK OVERVIEW
The overall flow of the proposed method is presented
in figure 1 (top image). It builds on a monocular visual
odometry framework, the propagation based tracking (PbT)
scheme [7]. An important principle of PbT is that the new
relative pose for a new frame n+1 is predicted using the car
ego-dynamics, and that a refined relative pose is computed
only on the basis of keypoints that have been tracked at least
twice, that is: keypoints which already passed a stringent test
of being belonging to the static environment. All keypoints,
including the new ones generated in sparsely covered areas
of a new frame, are tracked in an epipolar-guided manner as
discussed more detailed in section III-A.
In the present scheme, we detect IMO candidate patches
for each new frame using the instance segmentation scheme
of van den Brand et al. [22]. This results in M new IMO
candidate patches for each new frame n + 1. The generation
of the CNN-based IMO candidate patches is discussed in
section III-B.
All IMO candidate patches in image n are classified in
one of the three states static, IMO, or undetermined.
Keypoints that are located in IMO candidate patches are con-
sidered for pose estimation only if they have been classified
as static.
When a new frame n + 1 comes in, it is processed by the
CNN in order to determine the IMO candidate patches. At
the same moment, we have a set of old tracked keypoints
in previous frame n. Some of those are already confirmed
as static as they have been tracked at least twice and
have shown 3D consistency (see section IV-A). Others have
been tracked just once (from frame n − 1 to n) and are only
candidates for being considered as static.
Subsequently, the relative pose between frame n and
n + 1 is determined from all keypoints which are considered
to be static in frame n in the standard manner used in
propagation based tracking. With the resulting relative pose
n → n + 1 being computed, individual keypoints can
then be checked for being conformant with the epipolar
geometry. By accumulating the checks from all keypoints
on a patch, this leads to the classification of that patch into
IMO, static, or undetermined. This IMO detection
procedure is described in detail in section IV-A and section
IV-B.
For each of the new IMO candidate patches in frame
n + 1, an association with the existing patches in frame n
must be performed in order to propagate the motion state
(static/IMO) over time. The association can be made
appearance-based, that is: by comparing size and texture of
the patches, or tracking based, that is: by checking matching
residuals between keypoints inside of the patches. The inter-
frame car patch association is discussed in section V.
A. PbT framework
As said, we build our scheme on the monocular visual
odometry framework using propagation-based tracking (PbT)
proposed in [5], [6]. Principles of keypoint generation and
tracking are used in a similar way here, thus we give some
details in the following. The egomotion of the ego-car is
estimated using keypoints which have been confirmed to
be static. These keypoints are the combination of keypoints
outside CNN-based car masks and keypoints from car masks
that have been classified as static cars. In addition, PbT with
its epipolar constraint is able to propagate the static label of
a car mask on subsequent frames as long as the keypoints
inside that car mask are successfully tracked.
We find keypoints inside each IMO candidate patch by
choosing both corner and edgel points, as proposed in [18].
This approach is an extension of the good-feature-to-track
(GFTT) detector initially proposed by Shi and Tomasi [20].
As the matching and tracking processes used in the present
paper are guided by the epipolar geometry, patches which
have a local structure with only one dominant orientation
(e.g. lines and straight edges) can be matched as long as the
dominant orientation is sufficiently well inclined relative to
the epipolar line under consideration.
Each keypoint is represented by a 15 × 15 patch centered
on the keypoint and we use a 2D Gaussian filter with
the same size of the patch as a masking weight W for
each patch. In order to track the keypoint on subsequent
frames, we employ an iterative matching which minimizes
the photometric error between the patch correspondences.
When a new keypoint is tried to be matched for the first
time, we initialize the matching with motion prior informa-
tion as proposed in [4] followed by a Lucas-Kanade like
optimization to find the final match. The matching results of
a keypoint on consecutive frames form a keypoint trajectory.
A keypoint is finally accepted and used for pose estimation
958
评论