2003CVPR最佳论文-使用尺度无关的无监督学习实现物体类型识别.pdf

poPoq

138

9页

0次

2021-05-01

50墨值下载

Object Class Recognition by Unsupervised Scale-Invariant Learning

R. Fergus

P. Perona

A. Zisserman

Dept. of Engineering Science

Dept. of Electrical Engineering

University of Oxford California Institute of Technology

Parks Road, Oxford MC 136-93, Pasadena

OX1 3PJ, U.K. CA 91125, U.S.A.

{fergus,az}@robots.ox.ac.uk perona@vision.caltech.edu

Abstract

We present a method to learn and recognize object class

models from unlabeled and unsegmented cluttered scenes

in a scale invariant manner. Objects are modeled as ﬂexi-

ble constellations of parts. A probabilistic representation is

used for all aspects of the object: shape, appearance, occlu-

sion and relative scale. An entropy-based feature detector

is used to select regions and their scale within the image. In

learning the parameters of the scale-invariant object model

are estimated. This is done using expectation-maximization

in a maximum-likelihood setting. In recognition, this model

is used in a Bayesian manner to classify images. The ﬂex-

ible nature of the model is demonstrated by excellent re-

sults over a range of datasets including geometrically con-

strained classes (e.g. faces, cars) and ﬂexible objects (such

as animals).

1. Introduction

Representation, detection and learning are the main issues

that need to be tackled in designing a visual system for rec-

ognizing object categories. The ﬁrst challenge is coming

up with models that can capture the ‘essence’ of a cate-

gory, i.e. what is common to the objects that belong to it,

and yet are ﬂexible enough to accommodate object vari-

ability (e.g. presence/absence of distinctive parts such as

mustache and glasses, variability in overall shape, changing

appearance due to lighting conditions, viewpoint etc). The

challenge of detection is deﬁning metrics and inventing al-

gorithms that are suitable for matching models to images

efﬁciently in the presence of occlusion and clutter. Learn-

ing is the ultimate challenge. If we wish to be able to design

visual systems that can recognize, say, 10,000 object cate-

gories, then effortless learning is a crucial step. This means

that the training sets should be small and that the operator-

assisted steps that are required (e.g. elimination of clutter

in the background of the object, scale normalization of the

training sample) should be reduced to a minimum or elimi-

nated.

The problem of describing and recognizing categories,

as opposed to speciﬁc objects (e.g. [6, 9, 11]), has re-

cently gained some attention in the machine vision litera-

ture [1, 2, 3, 4, 13, 14, 19] with an emphasis on the de-

tection of faces [12, 15, 16]. There is broad agreement

on the issue of representation: object categories are rep-

resented as collection of features, or parts, each part has a

distinctive appearance and spatial position. Different au-

thors vary widely on the details: the number of parts they

envision (from a few to thousands of parts), how these parts

are detected and represented, how their position is repre-

sented, whether the variability in part appearance and posi-

tion is represented explicitly or is implicit in the details of

the matching algorithm. The issue of learning is perhaps the

least well understood. Most authors rely on manual steps to

eliminate background clutter and normalize the pose of the

training examples. Recognition often proceeds by an ex-

haustive search over image position and scale.

We focus our attention on the probabilistic approach pro-

posed by Burl et al. [4] which models objects as random

constellations of parts. This approach presents several ad-

vantages: the model explicitly accounts for shape variations

and for the randomness in the presence/absence of features

due to occlusion and detector errors. It accounts explicitly

for image clutter. It yields principled and efﬁcient detection

methods. Weber et al. [18, 19] proposed a maximum like-

lihood unsupervised learning algorithm for the constella-

tion model which successfully learns object categories from

cluttered data with minimal human intervention. We pro-

pose here a number of substantial improvement to the con-

stellation model and to its maximum likelihood learning al-

gorithm. First: while Burl et al. and Weber et al. model

explicitly shape variability, they do not model the variabil-

ity of appearance. We extend their model to take this aspect

into account. Second, appearance here is learnt simultane-

ously with shape, whereas in their work the appearance of a

part is ﬁxed before shape learning. Third: they use correla-

tion to detect their parts. We substitute their front end with

an interest operator, which detects regions and their scale in

the manner of [8, 10]. Fourthly, Weber et al. did not ex-

periment extensively with scale-invariant learning, most of

their training sets are collected in such a way that the scale is

approximately normalized. We extend their learning algo-

rithm so that new object categories may be learnt efﬁciently,

without supervision, from training sets where the object ex-

amples have large variability in scale. A ﬁnal contribution

is experimenting with a number of new image datasets to

validate the overall approach over several object categories.

Examples images from these datasets are shown in ﬁgure 1.

2. Approach

Our approach to modeling object classes follows on from

the work of Weber et al. [17, 18, 19]. An object model

consists of a number of parts. Each part has an appear-

ance, relative scale and can be be occluded or not. Shape

is represented by the mutual position of the parts. The en-

tire model is generative and probabilistic, so appearance,

scale, shape and occlusion are all modeled by probabil-

ity density functions, which here are Gaussians. The pro-

cess of learning an object category is one of ﬁrst detecting

regions and their scales, and then estimating the parame-

ters of the above densities from these regions, such that the

model gives a maximum-likelihood description of the train-

ing data. Recognition is performed on a query image by

again ﬁrst detecting regions and their scales, and then eval-

uating the regions in a Bayesian manner, using the model

parameters estimated in the learning.

The model, region detector, and representation of ap-

pearance are described in detail in the following subsec-

tions.

2.1. Model structure

The model is best explained by ﬁrst considering recogni-

tion. We have learnt a generative object class model, with

P parts and parameters θ. We are then presented with a

new image and we must decide if it contains an instance of

our object class or not. In this query image we have identi-

ﬁed N interesting features with locations X, scales S, and

appearances A. We now make a Bayesian decision, R:

R =

p(Object|X, S, A)

p(No object|X, S, A)

p(X, S, A|Object) p(Object)

p(X, S, A|No object) p(No object)

≈

p(X, S, A| θ) p(Object)

p(X, S, A|θ

) p(No object)

The last line is an approximation since we will only use a

single value for θ (the maximum-likelihood value) rather

than integrating over p(θ) as we strictly should. Likewise,

we assume that all non-object images can also be modeled

by a background with a single set of parameters θ

. The

ratio of the priors may be estimated from the training set or

set by hand (usually to 1). Our decision requires the calcu-

lation of the ratio of the two likelihood functions. In order

to do this, the likelihoods may be factored as follows:

p(X, S, A| θ)=



h∈H

p(X, S, A, h| θ)=



h∈H

p(A|X, S, h,θ)



 

Appearance

p(X|S, h,θ)



 

Shape

p(S|h,θ)



 

Rel. Scale

p(h|θ)



 

Other

Since our model only has P (typically 3-7) parts but there

are N (up to 30) features in the image, we introduce an in-

dexing variable h which we call a hypothesis. h is a vector

of length P , where each entry is between 0 and N which al-

locates a particular feature to a model part. The unallocated

features are assumed to be part of the background, with 0 in-

dicating the part is unavailable (e.g. because of occlusion).

The set H is all valid allocations of features to the parts;

consequently |H| is O(N

In the following we sketch the form for likelihood ra-

tios of each of the above factored terms. Space prevents

a full derivation being included, but the full expressions

follow from the methods of [17]. It will be helpful to de-

ﬁne the following notation: d = sign(h) (which is a bi-

nary vector giving the state of occlusion for each part),

n = N − sum(d) (the number of background features un-

der the current hypothesis), and f = sum(d) which is the

number of foreground features.

Appearance. Each feature’s appearance is represented

as a point in some appearance space, deﬁned below. Each

part p has a Gaussian density within this space, with mean

and covariance parameters θ

app

= {c

} which is inde-

pendent of other parts’ densities. The background model

has parameters θ

app

= {c

}. Both V

and V

are

assumed to be diagonal. Each feature selected by the hy-

pothesis is evaluated under the appropriate part density. All

features not selected by the hypothesis are evaluated under

the background density. The ratio reduces to:

p(A|X, S, h,θ)

p(A|X, S, h,θ

)



p=1



G(A(h

)|c

)

G(A(h

)|c

)



where G is the Gaussian distribution, and d

is the p

entry

of the vector d, i.e. d

= d(p). So the appearance of each

feature in the hypothesis is evaluated under foreground and

background densities and the ratio taken. If the part is oc-

cluded, the ratio is 1 (d

=0).

Object Class Recognition by Unsupervised Scale-Invariant Learning

R. Fergus

P. Perona

A. Zisserman

Dept. of Engineering Science

Dept. of Electrical Engineering

University of Oxford California Institute of Technology

Parks Road, Oxford MC 136-93, Pasadena

OX1 3PJ, U.K. CA 91125, U.S.A.

{fergus,az}@robots.ox.ac.uk perona@vision.caltech.edu

Abstract

We present a method to learn and recognize object class

models from unlabeled and unsegmented cluttered scenes

in a scale invariant manner. Objects are modeled as ﬂexi-

ble constellations of parts. A probabilistic representation is

used for all aspects of the object: shape, appearance, occlu-

sion and relative scale. An entropy-based feature detector

is used to select regions and their scale within the image. In

learning the parameters of the scale-invariant object model

are estimated. This is done using expectation-maximization

in a maximum-likelihood setting. In recognition, this model

is used in a Bayesian manner to classify images. The ﬂex-

ible nature of the model is demonstrated by excellent re-

sults over a range of datasets including geometrically con-

strained classes (e.g. faces, cars) and ﬂexible objects (such

as animals).

1. Introduction

Representation, detection and learning are the main issues

that need to be tackled in designing a visual system for rec-

ognizing object categories. The ﬁrst challenge is coming

up with models that can capture the ‘essence’ of a cate-

gory, i.e. what is common to the objects that belong to it,

and yet are ﬂexible enough to accommodate object vari-

ability (e.g. presence/absence of distinctive parts such as

mustache and glasses, variability in overall shape, changing

appearance due to lighting conditions, viewpoint etc). The

challenge of detection is deﬁning metrics and inventing al-

gorithms that are suitable for matching models to images

efﬁciently in the presence of occlusion and clutter. Learn-

ing is the ultimate challenge. If we wish to be able to design

visual systems that can recognize, say, 10,000 object cate-

gories, then effortless learning is a crucial step. This means

that the training sets should be small and that the operator-

assisted steps that are required (e.g. elimination of clutter

in the background of the object, scale normalization of the

training sample) should be reduced to a minimum or elimi-

nated.

The problem of describing and recognizing categories,

as opposed to speciﬁc objects (e.g. [6, 9, 11]), has re-

cently gained some attention in the machine vision litera-

ture [1, 2, 3, 4, 13, 14, 19] with an emphasis on the de-

tection of faces [12, 15, 16]. There is broad agreement

on the issue of representation: object categories are rep-

resented as collection of features, or parts, each part has a

distinctive appearance and spatial position. Different au-

thors vary widely on the details: the number of parts they

envision (from a few to thousands of parts), how these parts

are detected and represented, how their position is repre-

sented, whether the variability in part appearance and posi-

tion is represented explicitly or is implicit in the details of

the matching algorithm. The issue of learning is perhaps the

least well understood. Most authors rely on manual steps to

eliminate background clutter and normalize the pose of the

training examples. Recognition often proceeds by an ex-

haustive search over image position and scale.

We focus our attention on the probabilistic approach pro-

posed by Burl et al. [4] which models objects as random

constellations of parts. This approach presents several ad-

vantages: the model explicitly accounts for shape variations

and for the randomness in the presence/absence of features

due to occlusion and detector errors. It accounts explicitly

for image clutter. It yields principled and efﬁcient detection

methods. Weber et al. [18, 19] proposed a maximum like-

lihood unsupervised learning algorithm for the constella-

tion model which successfully learns object categories from

cluttered data with minimal human intervention. We pro-

pose here a number of substantial improvement to the con-

stellation model and to its maximum likelihood learning al-

gorithm. First: while Burl et al. and Weber et al. model

explicitly shape variability, they do not model the variabil-

ity of appearance. We extend their model to take this aspect

into account. Second, appearance here is learnt simultane-

ously with shape, whereas in their work the appearance of a

part is ﬁxed before shape learning. Third: they use correla-

tion to detect their parts. We substitute their front end with

an interest operator, which detects regions and their scale in

the manner of [8, 10]. Fourthly, Weber et al. did not ex-

periment extensively with scale-invariant learning, most of

their training sets are collected in such a way that the scale is

approximately normalized. We extend their learning algo-

rithm so that new object categories may be learnt efﬁciently,

without supervision, from training sets where the object ex-

amples have large variability in scale. A ﬁnal contribution

is experimenting with a number of new image datasets to

validate the overall approach over several object categories.

Examples images from these datasets are shown in ﬁgure 1.

2. Approach

Our approach to modeling object classes follows on from

the work of Weber et al. [17, 18, 19]. An object model

consists of a number of parts. Each part has an appear-

ance, relative scale and can be be occluded or not. Shape

is represented by the mutual position of the parts. The en-

tire model is generative and probabilistic, so appearance,

scale, shape and occlusion are all modeled by probabil-

ity density functions, which here are Gaussians. The pro-

cess of learning an object category is one of ﬁrst detecting

regions and their scales, and then estimating the parame-

ters of the above densities from these regions, such that the

model gives a maximum-likelihood description of the train-

ing data. Recognition is performed on a query image by

again ﬁrst detecting regions and their scales, and then eval-

uating the regions in a Bayesian manner, using the model

parameters estimated in the learning.

The model, region detector, and representation of ap-

pearance are described in detail in the following subsec-

tions.

2.1. Model structure

The model is best explained by ﬁrst considering recogni-

tion. We have learnt a generative object class model, with

P parts and parameters θ. We are then presented with a

new image and we must decide if it contains an instance of

our object class or not. In this query image we have identi-

ﬁed N interesting features with locations X, scales S, and

appearances A. We now make a Bayesian decision, R:

R =

p(Object|X, S, A)

p(No object|X, S, A)

p(X, S, A|Object) p(Object)

p(X, S, A|No object) p(No object)

≈

p(X, S, A| θ) p(Object)

p(X, S, A|θ

) p(No object)

The last line is an approximation since we will only use a

single value for θ (the maximum-likelihood value) rather

than integrating over p(θ) as we strictly should. Likewise,

we assume that all non-object images can also be modeled

by a background with a single set of parameters θ

. The

ratio of the priors may be estimated from the training set or

set by hand (usually to 1). Our decision requires the calcu-

lation of the ratio of the two likelihood functions. In order

to do this, the likelihoods may be factored as follows:

p(X, S, A| θ)=



h∈H

p(X, S, A, h| θ)=



h∈H

p(A|X, S, h,θ)



 

Appearance

p(X|S, h,θ)



 

Shape

p(S|h,θ)



 

Rel. Scale

p(h|θ)



 

Other

Since our model only has P (typically 3-7) parts but there

are N (up to 30) features in the image, we introduce an in-

dexing variable h which we call a hypothesis. h is a vector

of length P , where each entry is between 0 and N which al-

locates a particular feature to a model part. The unallocated

features are assumed to be part of the background, with 0 in-

dicating the part is unavailable (e.g. because of occlusion).

The set H is all valid allocations of features to the parts;

consequently |H| is O(N

In the following we sketch the form for likelihood ra-

tios of each of the above factored terms. Space prevents

a full derivation being included, but the full expressions

follow from the methods of [17]. It will be helpful to de-

ﬁne the following notation: d = sign(h) (which is a bi-

nary vector giving the state of occlusion for each part),

n = N − sum(d) (the number of background features un-

der the current hypothesis), and f = sum(d) which is the

number of foreground features.

Appearance. Each feature’s appearance is represented

as a point in some appearance space, deﬁned below. Each

part p has a Gaussian density within this space, with mean

and covariance parameters θ

app

= {c

} which is inde-

pendent of other parts’ densities. The background model

has parameters θ

app

= {c

}. Both V

and V

are

assumed to be diagonal. Each feature selected by the hy-

pothesis is evaluated under the appropriate part density. All

features not selected by the hypothesis are evaluated under

the background density. The ratio reduces to:

p(A|X, S, h,θ)

p(A|X, S, h,θ

)



p=1



G(A(h

)|c

)

G(A(h

)|c

)



where G is the Gaussian distribution, and d

is the p

entry

of the vector d, i.e. d

= d(p). So the appearance of each

feature in the hypothesis is evaluated under foreground and

background densities and the ratio taken. If the part is oc-

cluded, the ratio is 1 (d

=0).

of 9

50墨值下载

自动驾驶

关注

评论