超越卷积的迭代视觉推理.pdf

poPoq

210

11页

0次

2021-05-02

50墨值下载

Iterative Visual Reasoning Beyond Convolutions

Xinlei Chen

Li-Jia Li

Li Fei-Fei

Abhinav Gupta

Carnegie Mellon University

Google

Abstract

We present a novel framework for iterative visual reason-

ing. Our framework goes beyond current recognition sys-

tems that lack the capability to reason beyond stack of con-

volutions. The framework consists of two core modules: a

local module that uses spatial memory [4] to store previous

beliefs with parallel updates; and a global graph-reasoning

module. Our graph module has three components: a) a

knowledge graph where we represent classes as nodes and

build edges to encode different types of semantic relation-

ships between them; b) a region graph of the current image

where regions in the image are nodes and spatial relation-

ships between these regions are edges; c) an assignment

graph that assigns regions to classes. Both the local mod-

ule and the global module roll-out iteratively and cross-feed

predictions to each other to reﬁne estimates. The ﬁnal pre-

dictions are made by combining the best of both modules

with an attention mechanism. We show strong performance

over plain ConvNets, e.g. achieving an 8.4% absolute im-

provement on ADE [55] measured by per-class average pre-

cision. Analysis also shows that the framework is resilient

to missing regions for reasoning.

1. Introduction

In recent years, we have made signiﬁcant advances in

standard recognition tasks such as image classiﬁcation [16],

detection [37] or segmentation [3]. Most of these gains are

a result of using feed-forward end-to-end learned ConvNet

models. Unlike humans where visual reasoning about the

space and semantics is crucial [1], our current visual sys-

tems lack any context reasoning beyond convolutions with

large receptive ﬁelds. Therefore, a critical question is how

do we incorporate both spatial and semantic reasoning as

we build next-generation vision systems.

Our goal is to build a system that can not only extract

and utilize hierarchy of convolutional features, but also im-

prove its estimates via spatial and semantic relationships.

But what are spatial and semantic relationships and how can

they be used to improve recognition? Take a look at Fig. 1.

An example of spatial reasoning (top-left) would be: if three

regions out of four in a line are “window”, then the fourth is

also likely to be “window”. An example of semantic reason-

ing (bottom-right) would be to recognize “school bus” even

origin semantic

spatial

car

window

spatial reasoning

car

person

spatial-semantic

current systems semantic reasoning

school bus

bus

window

receptive fields

Figure 1. Current recognition systems lack the reasoning power

beyond convolutions with large receptive ﬁelds, whereas humans

can explore the rich space of spatial and semantic relationships for

reasoning: e.g. inferring the fourth “window” even with occlusion,

or the “person” who drives the “car”. To close this gap, we present

a generic framework that also uses relationships to iteratively rea-

son and build up estimates.

if we have seen few or no examples of it – just given exam-

ples of “bus” and knowing their connections. Finally, an ex-

ample of spatial-semantic reasoning could be: recognition

of a “car” on road should help in recognizing the “person”

inside “driving” the “car”.

A key recipe to reasoning with relationships is to it-

eratively build up estimates. Recently, there have been

efforts to incorporate such reasoning via top-down mod-

ules [38, 48] or using explicit memories [51, 32]. In the

case of top-down modules, high-level features which have

class-based information can be used in conjunction with

low-level features to improve recognition performance. An

arXiv:1803.11189v1 [cs.CV] 29 Mar 2018

alternative architecture is to use explicit memory. For exam-

ple, Chen & Gupta [4] performs sequential object detection,

where a spatial memory is used to store previously detected

objects, leveraging the power of ConvNets for extracting

dense context patterns beneﬁcial for follow-up detections.

However, there are two problems with these approaches:

a) both approaches use stack of convolutions to perform lo-

cal pixel-level reasoning [11], which can lack a global rea-

soning power that also allows regions farther away to di-

rectly communicate information; b) more importantly, both

approaches assume enough examples of relationships in

the training data – so that the model can learn them from

scratch, but as the relationships grow exponentially with in-

creasing number of classes, there is not always enough data.

A lot of semantic reasoning requires learning from few or

no examples [14]. Therefore, we need ways to exploit addi-

tional structured information for visual reasoning.

In this paper, we put forward a generic framework for

both spatial and semantic reasoning. Different from current

approaches that are just relying on convolutions, our frame-

work can also learn from structured information in the form

of knowledge bases [5, 56] for visual recognition. The core

of our algorithm consists of two modules: the local mod-

ule, based on spatial memory [4], performs pixel-level rea-

soning using ConvNets. We make major improvements on

efﬁciency by parallel memory updates. Additionally, we in-

troduce a global module for reasoning beyond local regions.

In the global module, reasoning is based on a graph struc-

ture. It has three components: a) a knowledge graph where

we represent classes as nodes and build edges to encode dif-

ferent types of semantic relationships; b) a region graph of

the current image where regions in the image are nodes and

spatial relationships between these regions are edges; c) an

assignment graph that assigns regions to classes. Taking

advantage of such a structure, we develop a reasoning mod-

ule speciﬁcally designed to pass information on this graph.

Both the local module and the global module roll-out itera-

tively and cross-feed predictions to each other in order to re-

ﬁne estimates. Note that, local and global reasoning are not

isolated: a good image understanding is usually a compro-

mise between background knowledge learned a priori and

image-speciﬁc observations. Therefore, our full pipeline

joins force of the two modules by an attention [3] mech-

anism allowing the model to rely on the most relevant fea-

tures when making the ﬁnal predictions.

We show strong performance over plain ConvNets using

our framework. For example, we can achieve 8.4% absolute

improvements on ADE [55] measured by per-class average

precision, where by simply making the network deeper can

only help ∼1%.

2. Related Work

Visual Knowledge Base. Whereas past ﬁve years in com-

puter vision will probably be remembered as the success-

ful resurgence of neural networks, acquiring visual knowl-

edge at a large scale – the simplest form being labeled in-

stances of objects [39, 30], scenes [55], relationships [25]

etc.– deserves at least half the credit, since ConvNets hinge

on large datasets [44]. Apart from providing labels us-

ing crowd-sourcing, attempts have also been made to ac-

cumulate structured knowledge (e.g. relationships [5], n-

grams [10]) automatically from the web. However, these

works ﬁxate on building knowledge bases rather than us-

ing knowledge for reasoning. Our framework, while being

more general, is along the line of research that applies vi-

sual knowledge base to end tasks, such as affordances [56],

image classiﬁcation [32], or question answering [49].

Context Modeling. Modeling context, or the interplay be-

tween scenes, objects and parts is one of the central prob-

lems in computer vision. While various previous work (e.g.

scene-level reasoning [46], attributes [13, 36], structured

prediction [24, 9, 47], relationship graph [21, 31, 52]) has

approached this problem from different angles, the break-

through comes from the idea of feature learning with Con-

vNets [16]. On the surface, such models hardly use any

explicit context module for reasoning, but it is generally ac-

cepted that ConvNets are extremely effective in aggregating

local pixel-to-level context through its ever-growing recep-

tive ﬁelds [54]. Even the most recent developments such as

top-down module [50, 29, 43], pairwise module [40], itera-

tive feedback [48, 34, 2], attention [53], and memory [51, 4]

are motivated to leverage such power and depend on vari-

ants of convolutions for reasoning. Our work takes an im-

portant next step beyond those approaches in that it also in-

corporates learning from structured visual knowledge bases

directly to reason with spatial and semantic relationships.

Relational Reasoning. The earliest form of reasoning in ar-

tiﬁcial intelligence dates back to symbolic approaches [33],

where relations between abstract symbols are deﬁned by

the language of mathematics and logic, and reasoning takes

place by deduction, abduction [18], etc. However, symbols

need to be grounded [15] before such systems are practi-

cally useful. Modern approaches, such as path ranking algo-

rithm [26], rely on statistical learning to extract useful pat-

terns to perform relational reasoning on structured knowl-

edge bases. As an active research area, there are recent

works also applying neural networks to the graph structured

data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize

the output of networks with relationships [8] and knowl-

edge bases [20]. However, we believe for visual data, rea-

soning should be both local and global: discarding the two-

dimensional image structure is neither efﬁcient nor effective

for tasks that involve regions.

3. Reasoning Framework

In this section we build up our reasoning framework. Be-

sides plain predictions p

from a ConvNet, it consists of

two core modules that reason to predict. The ﬁrst one, local

module, uses a spatial memory to store previous beliefs with

parallel updates, and still falls within the regime of convo-

lution based reasoning (Sec. 3.1). Beyond convolutions, we

present our key contribution – a global module that reasons

directly between regions and classes represented as nodes

in a graph (Sec. 3.2). Both modules build up estimation it-

of 11

50墨值下载

自动驾驶

关注

评论