
into account. Second, appearance here is learnt simultane-
ously with shape, whereas in their work the appearance of a
part is fixed before shape learning. Third: they use correla-
tion to detect their parts. We substitute their front end with
an interest operator, which detects regions and their scale in
the manner of [8, 10]. Fourthly, Weber et al. did not ex-
periment extensively with scale-invariant learning, most of
their training sets are collected in such a way that the scale is
approximately normalized. We extend their learning algo-
rithm so that new object categories may be learnt efficiently,
without supervision, from training sets where the object ex-
amples have large variability in scale. A final contribution
is experimenting with a number of new image datasets to
validate the overall approach over several object categories.
Examples images from these datasets are shown in figure 1.
2. Approach
Our approach to modeling object classes follows on from
the work of Weber et al. [17, 18, 19]. An object model
consists of a number of parts. Each part has an appear-
ance, relative scale and can be be occluded or not. Shape
is represented by the mutual position of the parts. The en-
tire model is generative and probabilistic, so appearance,
scale, shape and occlusion are all modeled by probabil-
ity density functions, which here are Gaussians. The pro-
cess of learning an object category is one of first detecting
regions and their scales, and then estimating the parame-
ters of the above densities from these regions, such that the
model gives a maximum-likelihood description of the train-
ing data. Recognition is performed on a query image by
again first detecting regions and their scales, and then eval-
uating the regions in a Bayesian manner, using the model
parameters estimated in the learning.
The model, region detector, and representation of ap-
pearance are described in detail in the following subsec-
tions.
2.1. Model structure
The model is best explained by first considering recogni-
tion. We have learnt a generative object class model, with
P parts and parameters θ. We are then presented with a
new image and we must decide if it contains an instance of
our object class or not. In this query image we have identi-
fied N interesting features with locations X, scales S, and
appearances A. We now make a Bayesian decision, R:
R =
p(Object|X, S, A)
p(No object|X, S, A)
=
p(X, S, A|Object) p(Object)
p(X, S, A|No object) p(No object)
≈
p(X, S, A| θ) p(Object)
p(X, S, A|θ
bg
) p(No object)
The last line is an approximation since we will only use a
single value for θ (the maximum-likelihood value) rather
than integrating over p(θ) as we strictly should. Likewise,
we assume that all non-object images can also be modeled
by a background with a single set of parameters θ
bg
. The
ratio of the priors may be estimated from the training set or
set by hand (usually to 1). Our decision requires the calcu-
lation of the ratio of the two likelihood functions. In order
to do this, the likelihoods may be factored as follows:
p(X, S, A| θ)=
h∈H
p(X, S, A, h| θ)=
h∈H
p(A|X, S, h,θ)
Appearance
p(X|S, h,θ)
Shape
p(S|h,θ)
Rel. Scale
p(h|θ)
Other
Since our model only has P (typically 3-7) parts but there
are N (up to 30) features in the image, we introduce an in-
dexing variable h which we call a hypothesis. h is a vector
of length P , where each entry is between 0 and N which al-
locates a particular feature to a model part. The unallocated
features are assumed to be part of the background, with 0 in-
dicating the part is unavailable (e.g. because of occlusion).
The set H is all valid allocations of features to the parts;
consequently |H| is O(N
P
).
In the following we sketch the form for likelihood ra-
tios of each of the above factored terms. Space prevents
a full derivation being included, but the full expressions
follow from the methods of [17]. It will be helpful to de-
fine the following notation: d = sign(h) (which is a bi-
nary vector giving the state of occlusion for each part),
n = N − sum(d) (the number of background features un-
der the current hypothesis), and f = sum(d) which is the
number of foreground features.
Appearance. Each feature’s appearance is represented
as a point in some appearance space, defined below. Each
part p has a Gaussian density within this space, with mean
and covariance parameters θ
app
p
= {c
p
,V
p
} which is inde-
pendent of other parts’ densities. The background model
has parameters θ
app
bg
= {c
bg
,V
bg
}. Both V
p
and V
bg
are
assumed to be diagonal. Each feature selected by the hy-
pothesis is evaluated under the appropriate part density. All
features not selected by the hypothesis are evaluated under
the background density. The ratio reduces to:
p(A|X, S, h,θ)
p(A|X, S, h,θ
bg
)
=
P
p=1
G(A(h
p
)|c
p
,V
p
)
G(A(h
p
)|c
bg
,V
bg
)
d
p
where G is the Gaussian distribution, and d
p
is the p
th
entry
of the vector d, i.e. d
p
= d(p). So the appearance of each
feature in the hypothesis is evaluated under foreground and
background densities and the ratio taken. If the part is oc-
cluded, the ratio is 1 (d
p
=0).
评论