学习智能对话框用于边界框注释.pdf

poPoq

125

11页

0次

2021-05-01

50墨值下载

Learning Intelligent Dialogs for Bounding Box Annotation

Ksenia Konyushkova

CVLab, EPFL

ksenia.konyushkova@epfl.ch

∗

Jasper Uijlings

Google AI Perception

jrru@google.com

Christoph H. Lampert

IST Austria

chl@ist.ac.at

Vittorio Ferrari

Google AI Perception

vittoferrari@google.com

Abstract

We introduce Intelligent Annotation Dialogs for bounding

box annotation. We train an agent to automatically choose

a sequence of actions for a human annotator to produce

a bounding box in a minimal amount of time. Speciﬁcally,

we consider two actions: box veriﬁcation [

], where the

annotator veriﬁes a box generated by an object detector, and

manual box drawing. We explore two kinds of agents, one

based on predicting the probability that a box will be posi-

tively veriﬁed, and the other based on reinforcement learning.

We demonstrate that (1) our agents are able to learn efﬁcient

annotation strategies in several scenarios, automatically

adapting to the image difﬁculty, the desired quality of the

boxes, and the detector strength; (2) in all scenarios the re-

sulting annotation dialogs speed up annotation compared to

manual box drawing alone and box veriﬁcation alone, while

also outperforming any ﬁxed combination of veriﬁcation and

drawing in most scenarios; (3) in a realistic scenario where

the detector is iteratively re-trained, our agents evolve a

series of strategies that reﬂect the shifting trade-off between

veriﬁcation and drawing as the detector grows stronger.

1. Introduction

Many recent advances in computer vision rely on super-

vised machine learning techniques that are known to crave

for huge amounts of training data. Object detection is no

exception as state-of-the-art methods require a large number

of images with annotated bounding boxes around objects.

However, drawing high quality bounding boxes is expensive:

The ofﬁcial protocol used to annotate ILSVRC [

] takes

about 30 seconds per box [

]. To reduce this cost, recent

works explore cheaper forms of human supervision such as

image-level labels [

], box veriﬁcation series [

point annotations [27, 33], and eye-tracking [31].

Among these forms, the recent work on box veriﬁcation

series [

] stands out as it demonstrated to deliver high

quality detectors at low cost. The scheme starts from a given

∗

This work was done during an internship at Google AI Perception

Figure 1: Left: an image with a target class cat. The weak

detector identiﬁed two box proposals with high scores. The

best strategy in this case is to do a series of box veriﬁcations.

Right: an image with a target class potted plant. The weak

detector identiﬁed many box proposals with low scores. The

best strategy is to draw a box.

weak detector, typically trained on image labels only, and

uses it to localize objects in the images. For each image, the

annotator is asked to verify whether the box produced by

the algorithm covers an object tightly enough. If not, the

process iterates: the algorithm proposes another box and the

annotator veriﬁes it.

The success of box veriﬁcation series depends on a variety

of factors. For example, large objects on homogeneous back-

grounds are likely to be found early in the series, and hence

require little annotation time (Fig. 1, left). However, small

objects in crowded scenes might require many iterations, or

could even not be found at all (Fig. 1, right). Furthermore,

the stronger the detector is, the more likely it is to correctly

localize new objects, and to do so early in the series. Finally,

the higher the desired box quality (i.e. how tight they should

be), the lower the rate of positively veriﬁed boxes. This

causes longer series, costing more annotation time. There-

fore, in some situations manual box drawing [

] is

preferable. While more expensive than one veriﬁcation, it

always produces a box annotation. When an annotation

episode consists of many veriﬁcations, its duration can be

longer than the time to draw a box, depending on the relative

costs of the two actions. Thus, different forms of annotation

are more efﬁcient in different situations.

In this paper we introduce Intelligent Annotation Dialogs

(IAD) for bounding box annotation. Given an image, detec-

tor, and target class to be annotated, the aim of IAD is to

automatically choose the sequence of annotation actions that

arXiv:1712.08087v3 [cs.CV] 20 Nov 2018

results in producing a bounding box in the least amount of

time. We train an IAD agent to select the type of action based

on previous experience in annotating images. Our method au-

tomatically adapts to the difﬁculty of the image, the strength

of the detector, the desired quality of the boxes, and other fac-

tors. This is achieved by modeling the episode duration as a

function of problem properties. We consider two alternative

ways to do this, either a) by predicting whether a proposed

box will be positively or negatively veriﬁed (Sec. 4.1), or

b) by directly predicting the episode duration (Sec. 4.2).

We evaluate IAD by annotating bounding boxes in the

PASCAL VOC 2007 dataset [

] in several scenarios:

a) with various desired quality levels; b) with detectors of

varying strength; and c) with two ways to draw bounding

boxes, including a recent method which only takes 7s per

box [

]. In all scenarios our experiments demonstrate that

thanks to its adaptive behavior IAD speeds up box annotation

compared to manual box drawing alone, or box veriﬁcation

series alone. Moreover, it outperforms any ﬁxed combina-

tion of them in most scenarios. Finally, we demonstrate

that IAD learns useful strategies in a complex realistic sce-

nario where the detector is continuously improved with the

growing amount of the training data. Our IAD code is made

publicly available

2. Related work

Drawing bounding boxes

Fully supervised object detec-

tors are trained on data with manually annotated bounding

boxes, which is costly. The reference box drawing inter-

face [

] used to annotate ILSVRC [

] requires 25.5s for

drawing one box. Recently, a more efﬁcient interface re-

duces costs to 7.0s without compromising on quality [

We consider both interfaces in this paper.

Training object detectors from image-level labels

Weakly Supervised Object Localization (WSOL) [

] methods train object detectors from images la-

beled only as containing certain object classes, but without

bounding boxes. This avoids the cost of box annotation,

but leads to considerably weaker detectors than their fully

supervised counterparts [

]. To produce better

object detectors, extra human annotation is required.

Other forms of weak supervision

Several works aim to

reduce annotation cost of manual drawing by using ap-

proximate forms of annotation. Eye-tracking is used for

training object detectors [

] and for action recognition in

videos [

]. Point-clicks are used to derive bounding box

annotations in images [

] and video [

], and to train se-

mantic segmentation models [

]. Other works train

semantic segmentation models using scribbles [25, 51].

In this paper we build on box veriﬁcation series [

where boxes are iteratively proposed by an object detector

https://github.com/google/intelligent_annotation_dialogs

and veriﬁed by a human annotator. Experiments show that

humans can perform box veriﬁcation reliably (Fig.

of [

]).

Besides, the Open Images dataset [

] contains

2.5

Million

boxes annotated in this manner, demonstrating it can be done

at scale.

Interactive annotation

Several works use human-

machine collaboration to efﬁciently produce anno-

tations. These works address interactive segmenta-

tion [

], attribute-based ﬁne-grained

image classiﬁcation [

], and interactive video

annotation [

]. Branson et al. [

] transform different

types of location information (e.g. parts, bounding boxes,

segmentations) into each other with corrections from an

annotator. These works follow a predeﬁned annotation pro-

tocol, whereas we explore algorithms that can automatically

select questions, adapting to the input image, the desired

quality of the annotation, and other factors.

The closest work [

] to ours proposes human-machine

collaboration for bounding box annotation. Given a reper-

toire of questions, the problem is modeled with a Markov

decision process. Our work differs in several respects.

(1) While Russakovsky et al. [

] optimizes the expected

precision of annotations over the whole dataset, our method

delivers quality guarantees on each individual box. (2) Our

approach of Sec 4.1 is mediated by predicting the probability

of a box to be accepted by an annotator. Based on this, we

provide a provably optimal strategy which minimizes the

expected annotation time. (3) Our reinforcement learning

approach of Sec. 4.2 learns a direct mapping from from

measurable properties to annotation time, while avoiding

any explicit modelling of the task. (4) Finally, we address a

scenario where the detector is iteratively updated (Sec. 5.3),

as opposed to keeping it ﬁxed.

Active learning (AL)

In active learning the goal is to train

a model while asking human annotations for unlabeled ex-

amples which are expected to improve the model accuracy

the most. It is used in computer vision to train whole-image

classiﬁers [

], object class detectors [

], and se-

mantic segmentation [

]. While the goal of AL is

to select a subset of the data to be annotated, this paper aims

at minimizing the time to annotate each of the examples.

Reinforcement learning

Reinforcement learning (RL)

traditionally aims at learning policies that allow autonomous

agents to act in interactive environments. Reinforcement

learning has a long tradition e.g. in robotics [

]. In

computer vision, it has mainly been used for active vision

tasks [

], such as learning a policy for the spatial

exploration of large images or panoramic images. Our use

of RL differs from this, as we learn a policy for image anno-

tation, not for image analysis. The learned policy enables

the system to dynamically choose the annotation mechanism

by which to interact with the user.

of 11

50墨值下载

自动驾驶

关注

评论