暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
2003CVPR最佳论文-使用尺度无关的无监督学习实现物体类型识别.pdf
138
9页
0次
2021-05-01
50墨值下载
Object Class Recognition by Unsupervised Scale-Invariant Learning
R. Fergus
1
P. Perona
2
A. Zisserman
1
1
Dept. of Engineering Science
2
Dept. of Electrical Engineering
University of Oxford California Institute of Technology
Parks Road, Oxford MC 136-93, Pasadena
OX1 3PJ, U.K. CA 91125, U.S.A.
{fergus,az}@robots.ox.ac.uk perona@vision.caltech.edu
Abstract
We present a method to learn and recognize object class
models from unlabeled and unsegmented cluttered scenes
in a scale invariant manner. Objects are modeled as flexi-
ble constellations of parts. A probabilistic representation is
used for all aspects of the object: shape, appearance, occlu-
sion and relative scale. An entropy-based feature detector
is used to select regions and their scale within the image. In
learning the parameters of the scale-invariant object model
are estimated. This is done using expectation-maximization
in a maximum-likelihood setting. In recognition, this model
is used in a Bayesian manner to classify images. The flex-
ible nature of the model is demonstrated by excellent re-
sults over a range of datasets including geometrically con-
strained classes (e.g. faces, cars) and flexible objects (such
as animals).
1. Introduction
Representation, detection and learning are the main issues
that need to be tackled in designing a visual system for rec-
ognizing object categories. The first challenge is coming
up with models that can capture the ‘essence’ of a cate-
gory, i.e. what is common to the objects that belong to it,
and yet are flexible enough to accommodate object vari-
ability (e.g. presence/absence of distinctive parts such as
mustache and glasses, variability in overall shape, changing
appearance due to lighting conditions, viewpoint etc). The
challenge of detection is defining metrics and inventing al-
gorithms that are suitable for matching models to images
efficiently in the presence of occlusion and clutter. Learn-
ing is the ultimate challenge. If we wish to be able to design
visual systems that can recognize, say, 10,000 object cate-
gories, then effortless learning is a crucial step. This means
that the training sets should be small and that the operator-
assisted steps that are required (e.g. elimination of clutter
in the background of the object, scale normalization of the
training sample) should be reduced to a minimum or elimi-
nated.
The problem of describing and recognizing categories,
as opposed to specific objects (e.g. [6, 9, 11]), has re-
cently gained some attention in the machine vision litera-
ture [1, 2, 3, 4, 13, 14, 19] with an emphasis on the de-
tection of faces [12, 15, 16]. There is broad agreement
on the issue of representation: object categories are rep-
resented as collection of features, or parts, each part has a
distinctive appearance and spatial position. Different au-
thors vary widely on the details: the number of parts they
envision (from a few to thousands of parts), how these parts
are detected and represented, how their position is repre-
sented, whether the variability in part appearance and posi-
tion is represented explicitly or is implicit in the details of
the matching algorithm. The issue of learning is perhaps the
least well understood. Most authors rely on manual steps to
eliminate background clutter and normalize the pose of the
training examples. Recognition often proceeds by an ex-
haustive search over image position and scale.
We focus our attention on the probabilistic approach pro-
posed by Burl et al. [4] which models objects as random
constellations of parts. This approach presents several ad-
vantages: the model explicitly accounts for shape variations
and for the randomness in the presence/absence of features
due to occlusion and detector errors. It accounts explicitly
for image clutter. It yields principled and efficient detection
methods. Weber et al. [18, 19] proposed a maximum like-
lihood unsupervised learning algorithm for the constella-
tion model which successfully learns object categories from
cluttered data with minimal human intervention. We pro-
pose here a number of substantial improvement to the con-
stellation model and to its maximum likelihood learning al-
gorithm. First: while Burl et al. and Weber et al. model
explicitly shape variability, they do not model the variabil-
ity of appearance. We extend their model to take this aspect
into account. Second, appearance here is learnt simultane-
ously with shape, whereas in their work the appearance of a
part is fixed before shape learning. Third: they use correla-
tion to detect their parts. We substitute their front end with
an interest operator, which detects regions and their scale in
the manner of [8, 10]. Fourthly, Weber et al. did not ex-
periment extensively with scale-invariant learning, most of
their training sets are collected in such a way that the scale is
approximately normalized. We extend their learning algo-
rithm so that new object categories may be learnt efficiently,
without supervision, from training sets where the object ex-
amples have large variability in scale. A final contribution
is experimenting with a number of new image datasets to
validate the overall approach over several object categories.
Examples images from these datasets are shown in figure 1.
2. Approach
Our approach to modeling object classes follows on from
the work of Weber et al. [17, 18, 19]. An object model
consists of a number of parts. Each part has an appear-
ance, relative scale and can be be occluded or not. Shape
is represented by the mutual position of the parts. The en-
tire model is generative and probabilistic, so appearance,
scale, shape and occlusion are all modeled by probabil-
ity density functions, which here are Gaussians. The pro-
cess of learning an object category is one of first detecting
regions and their scales, and then estimating the parame-
ters of the above densities from these regions, such that the
model gives a maximum-likelihood description of the train-
ing data. Recognition is performed on a query image by
again first detecting regions and their scales, and then eval-
uating the regions in a Bayesian manner, using the model
parameters estimated in the learning.
The model, region detector, and representation of ap-
pearance are described in detail in the following subsec-
tions.
2.1. Model structure
The model is best explained by first considering recogni-
tion. We have learnt a generative object class model, with
P parts and parameters θ. We are then presented with a
new image and we must decide if it contains an instance of
our object class or not. In this query image we have identi-
fied N interesting features with locations X, scales S, and
appearances A. We now make a Bayesian decision, R:
R =
p(Object|X, S, A)
p(No object|X, S, A)
=
p(X, S, A|Object) p(Object)
p(X, S, A|No object) p(No object)
p(X, S, A| θ) p(Object)
p(X, S, A|θ
bg
) p(No object)
The last line is an approximation since we will only use a
single value for θ (the maximum-likelihood value) rather
than integrating over p(θ) as we strictly should. Likewise,
we assume that all non-object images can also be modeled
by a background with a single set of parameters θ
bg
. The
ratio of the priors may be estimated from the training set or
set by hand (usually to 1). Our decision requires the calcu-
lation of the ratio of the two likelihood functions. In order
to do this, the likelihoods may be factored as follows:
p(X, S, A| θ)=
hH
p(X, S, A, h| θ)=
hH
p(A|X, S, h)

Appearance
p(X|S, h)

Shape
p(S|h)

Rel. Scale
p(h|θ)

Other
Since our model only has P (typically 3-7) parts but there
are N (up to 30) features in the image, we introduce an in-
dexing variable h which we call a hypothesis. h is a vector
of length P , where each entry is between 0 and N which al-
locates a particular feature to a model part. The unallocated
features are assumed to be part of the background, with 0 in-
dicating the part is unavailable (e.g. because of occlusion).
The set H is all valid allocations of features to the parts;
consequently |H| is O(N
P
).
In the following we sketch the form for likelihood ra-
tios of each of the above factored terms. Space prevents
a full derivation being included, but the full expressions
follow from the methods of [17]. It will be helpful to de-
fine the following notation: d = sign(h) (which is a bi-
nary vector giving the state of occlusion for each part),
n = N sum(d) (the number of background features un-
der the current hypothesis), and f = sum(d) which is the
number of foreground features.
Appearance. Each feature’s appearance is represented
as a point in some appearance space, defined below. Each
part p has a Gaussian density within this space, with mean
and covariance parameters θ
app
p
= {c
p
,V
p
} which is inde-
pendent of other parts’ densities. The background model
has parameters θ
app
bg
= {c
bg
,V
bg
}. Both V
p
and V
bg
are
assumed to be diagonal. Each feature selected by the hy-
pothesis is evaluated under the appropriate part density. All
features not selected by the hypothesis are evaluated under
the background density. The ratio reduces to:
p(A|X, S, h)
p(A|X, S, h
bg
)
=
P
p=1
G(A(h
p
)|c
p
,V
p
)
G(A(h
p
)|c
bg
,V
bg
)
d
p
where G is the Gaussian distribution, and d
p
is the p
th
entry
of the vector d, i.e. d
p
= d(p). So the appearance of each
feature in the hypothesis is evaluated under foreground and
background densities and the ratio taken. If the part is oc-
cluded, the ratio is 1 (d
p
=0).
of 9
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜