暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
超越卷积的迭代视觉推理.pdf
210
11页
0次
2021-05-02
50墨值下载
Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen
1
Li-Jia Li
2
Li Fei-Fei
2
Abhinav Gupta
1
1
Carnegie Mellon University
2
Google
Abstract
We present a novel framework for iterative visual reason-
ing. Our framework goes beyond current recognition sys-
tems that lack the capability to reason beyond stack of con-
volutions. The framework consists of two core modules: a
local module that uses spatial memory [4] to store previous
beliefs with parallel updates; and a global graph-reasoning
module. Our graph module has three components: a) a
knowledge graph where we represent classes as nodes and
build edges to encode different types of semantic relation-
ships between them; b) a region graph of the current image
where regions in the image are nodes and spatial relation-
ships between these regions are edges; c) an assignment
graph that assigns regions to classes. Both the local mod-
ule and the global module roll-out iteratively and cross-feed
predictions to each other to refine estimates. The final pre-
dictions are made by combining the best of both modules
with an attention mechanism. We show strong performance
over plain ConvNets, e.g. achieving an 8.4% absolute im-
provement on ADE [55] measured by per-class average pre-
cision. Analysis also shows that the framework is resilient
to missing regions for reasoning.
1. Introduction
In recent years, we have made significant advances in
standard recognition tasks such as image classification [16],
detection [37] or segmentation [3]. Most of these gains are
a result of using feed-forward end-to-end learned ConvNet
models. Unlike humans where visual reasoning about the
space and semantics is crucial [1], our current visual sys-
tems lack any context reasoning beyond convolutions with
large receptive fields. Therefore, a critical question is how
do we incorporate both spatial and semantic reasoning as
we build next-generation vision systems.
Our goal is to build a system that can not only extract
and utilize hierarchy of convolutional features, but also im-
prove its estimates via spatial and semantic relationships.
But what are spatial and semantic relationships and how can
they be used to improve recognition? Take a look at Fig. 1.
An example of spatial reasoning (top-left) would be: if three
regions out of four in a line are “window”, then the fourth is
also likely to be “window”. An example of semantic reason-
ing (bottom-right) would be to recognize “school bus” even
origin semantic
spatial
car
window
spatial reasoning
car
person
spatial-semantic
current systems semantic reasoning
school bus
bus
window
receptive fields
Figure 1. Current recognition systems lack the reasoning power
beyond convolutions with large receptive fields, whereas humans
can explore the rich space of spatial and semantic relationships for
reasoning: e.g. inferring the fourth “window” even with occlusion,
or the “person” who drives the “car”. To close this gap, we present
a generic framework that also uses relationships to iteratively rea-
son and build up estimates.
if we have seen few or no examples of it – just given exam-
ples of “bus” and knowing their connections. Finally, an ex-
ample of spatial-semantic reasoning could be: recognition
of a “car” on road should help in recognizing the “person”
inside “driving” the “car”.
A key recipe to reasoning with relationships is to it-
eratively build up estimates. Recently, there have been
efforts to incorporate such reasoning via top-down mod-
ules [38, 48] or using explicit memories [51, 32]. In the
case of top-down modules, high-level features which have
class-based information can be used in conjunction with
low-level features to improve recognition performance. An
1
arXiv:1803.11189v1 [cs.CV] 29 Mar 2018
alternative architecture is to use explicit memory. For exam-
ple, Chen & Gupta [4] performs sequential object detection,
where a spatial memory is used to store previously detected
objects, leveraging the power of ConvNets for extracting
dense context patterns beneficial for follow-up detections.
However, there are two problems with these approaches:
a) both approaches use stack of convolutions to perform lo-
cal pixel-level reasoning [11], which can lack a global rea-
soning power that also allows regions farther away to di-
rectly communicate information; b) more importantly, both
approaches assume enough examples of relationships in
the training data so that the model can learn them from
scratch, but as the relationships grow exponentially with in-
creasing number of classes, there is not always enough data.
A lot of semantic reasoning requires learning from few or
no examples [14]. Therefore, we need ways to exploit addi-
tional structured information for visual reasoning.
In this paper, we put forward a generic framework for
both spatial and semantic reasoning. Different from current
approaches that are just relying on convolutions, our frame-
work can also learn from structured information in the form
of knowledge bases [5, 56] for visual recognition. The core
of our algorithm consists of two modules: the local mod-
ule, based on spatial memory [4], performs pixel-level rea-
soning using ConvNets. We make major improvements on
efficiency by parallel memory updates. Additionally, we in-
troduce a global module for reasoning beyond local regions.
In the global module, reasoning is based on a graph struc-
ture. It has three components: a) a knowledge graph where
we represent classes as nodes and build edges to encode dif-
ferent types of semantic relationships; b) a region graph of
the current image where regions in the image are nodes and
spatial relationships between these regions are edges; c) an
assignment graph that assigns regions to classes. Taking
advantage of such a structure, we develop a reasoning mod-
ule specifically designed to pass information on this graph.
Both the local module and the global module roll-out itera-
tively and cross-feed predictions to each other in order to re-
fine estimates. Note that, local and global reasoning are not
isolated: a good image understanding is usually a compro-
mise between background knowledge learned a priori and
image-specific observations. Therefore, our full pipeline
joins force of the two modules by an attention [3] mech-
anism allowing the model to rely on the most relevant fea-
tures when making the final predictions.
We show strong performance over plain ConvNets using
our framework. For example, we can achieve 8.4% absolute
improvements on ADE [55] measured by per-class average
precision, where by simply making the network deeper can
only help 1%.
2. Related Work
Visual Knowledge Base. Whereas past five years in com-
puter vision will probably be remembered as the success-
ful resurgence of neural networks, acquiring visual knowl-
edge at a large scale the simplest form being labeled in-
stances of objects [39, 30], scenes [55], relationships [25]
etc.– deserves at least half the credit, since ConvNets hinge
on large datasets [44]. Apart from providing labels us-
ing crowd-sourcing, attempts have also been made to ac-
cumulate structured knowledge (e.g. relationships [5], n-
grams [10]) automatically from the web. However, these
works fixate on building knowledge bases rather than us-
ing knowledge for reasoning. Our framework, while being
more general, is along the line of research that applies vi-
sual knowledge base to end tasks, such as affordances [56],
image classification [32], or question answering [49].
Context Modeling. Modeling context, or the interplay be-
tween scenes, objects and parts is one of the central prob-
lems in computer vision. While various previous work (e.g.
scene-level reasoning [46], attributes [13, 36], structured
prediction [24, 9, 47], relationship graph [21, 31, 52]) has
approached this problem from different angles, the break-
through comes from the idea of feature learning with Con-
vNets [16]. On the surface, such models hardly use any
explicit context module for reasoning, but it is generally ac-
cepted that ConvNets are extremely effective in aggregating
local pixel-to-level context through its ever-growing recep-
tive fields [54]. Even the most recent developments such as
top-down module [50, 29, 43], pairwise module [40], itera-
tive feedback [48, 34, 2], attention [53], and memory [51, 4]
are motivated to leverage such power and depend on vari-
ants of convolutions for reasoning. Our work takes an im-
portant next step beyond those approaches in that it also in-
corporates learning from structured visual knowledge bases
directly to reason with spatial and semantic relationships.
Relational Reasoning. The earliest form of reasoning in ar-
tificial intelligence dates back to symbolic approaches [33],
where relations between abstract symbols are defined by
the language of mathematics and logic, and reasoning takes
place by deduction, abduction [18], etc. However, symbols
need to be grounded [15] before such systems are practi-
cally useful. Modern approaches, such as path ranking algo-
rithm [26], rely on statistical learning to extract useful pat-
terns to perform relational reasoning on structured knowl-
edge bases. As an active research area, there are recent
works also applying neural networks to the graph structured
data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize
the output of networks with relationships [8] and knowl-
edge bases [20]. However, we believe for visual data, rea-
soning should be both local and global: discarding the two-
dimensional image structure is neither efficient nor effective
for tasks that involve regions.
3. Reasoning Framework
In this section we build up our reasoning framework. Be-
sides plain predictions p
0
from a ConvNet, it consists of
two core modules that reason to predict. The first one, local
module, uses a spatial memory to store previous beliefs with
parallel updates, and still falls within the regime of convo-
lution based reasoning (Sec. 3.1). Beyond convolutions, we
present our key contribution – a global module that reasons
directly between regions and classes represented as nodes
in a graph (Sec. 3.2). Both modules build up estimation it-
of 11
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜