II. PREVIOUS WORK
Our work draws inspiration from recent work on one-
shot imitation learning. Duan et al. [7], for example, use
simulation to train a network by watching a user demonstra-
tion and replicating it with a robot. The method leverages a
special neural network architecture that extensively uses soft-
attention in combination with memory. During an extensive
training phase in a simulated environment, the network learns
to correctly repeat a demonstrated block stacking task. The
complexity of the architecture, in particular the attention and
memory mechanisms, support robustness when repeating the
demonstration, e.g., allowing the robot to repeat a failed step.
However, the intermediate representations are not designed
for interpretability. As argued in [11], the ability to generate
human interpretable representations is crucial for modularity
and stronger generalization, and thus it is a main focus of
our work.
Another closely related work by Denil et al. [5] learns
programmable agents capable of executing readable pro-
grams. They consider reinforcement learning in the context
of a simulated manipulator reaching task. This work draws
parallels to the third component of our architecture (the
execution network), which translates a human-readable plan
into a closed-loop sequence of robotic actions. Further, our
approach of decomposing the system is similar in spirit to
the modular networks of [6].
These prior approaches operate on a low-dimensional
representation of the objects in the environment and train
in simulation. Like Duan et al. [7], we acquire a label-free
low-dimensional representation of the world by leveraging
simulation-to-reality transfer. We use the simple but effective
principle of domain randomization [27] for transferring a
representation learned entirely in simulation. This approach
has been successfully applied in several robotic learning
applications, including the aforementioned work in demon-
stration learning [7], as well as grasp prediction [27] and
visuomotor control [16]. Building on prior work, we acquire
a more detailed description of the objects in a scene using
object part inference inspired by Cao et al. [3], allowing the
extraction of interpretable intermediate representations and
inference of additional object parameters, such as orientation.
Further, we make predictions in image space, so that robust
transfer to the real world requires only determining the
camera’s extrinsic parameters, rather than needing to develop
a simulated world to match the real environment for training.
Related work in imitation learning trains agents via
demonstrations. These methods typically focus on learning a
single complex task, e.g., steering a car based on human
demonstrations, instead of learning how to perform one-
shot replication in a multi-task scenario, e.g., repeating a
specific demonstrated block stacking sequence. Behavior
cloning [22], [23], [24] treats learning from demonstration as
a supervised learning problem, teaching an agent to exactly
replicate the behavior of an expert by learning a function
from the observed state to the next expert action. This ap-
proach may suffer as errors accumulate in the agent’s actions
leading eventually to states not encountered by the expert.
Inverse reinforcement learning [15], [19], [1] mitigates this
problem by estimating a reward function to explain the
behavior of the expert and training a policy with the learned
reward mapping. It typically requires running an expensive
reinforcement learning training step in the inner loop of
optimization or, alternatively, applying generative adversarial
networks [14], [13] or guided cost learning [10].
The conjunction of language and vision for environment
understanding has a long history. Early work by Wino-
grad [30] explored the use of language for a human to guide
and interpret interactions between a computerized agent and
a simulated 3D environment. Models can be learned to per-
form automatic image captioning [28], video captioning [25],
visual question answering [12], and understanding of and
reasoning about visual relationships [18], [21], [17]—all
interacting with a visual scene in natural language.
Recent work has studied the grounding of natural language
instructions in the context of robotics [20]. Natural language
utterances and the robot’s visual observations are grounded
in a symbolic space and the associated grounding graph,
allowing the robot to infer the specific actions required to
follow a subsequent verbal command.
Neural task programming (NTP) [31], a concurrent work
to ours, achieves one-shot imitation learning with an on-
line hierarchical task decomposition. An RNN-based NTP
model processes a demonstration to predict the next sub-
task (e.g., “pick-and-place”) and a relevant sub-sequence of
the demonstration, which are recursively input to the NTP
model. The base of the hierarchy is made up of primitive
sub-tasks (e.g., “grip”, “move”, or “release”), and recursive
sub-task prediction is made with the current observed state as
input, allowing closed loop control. Like our work, the NTP
model provides a human-readable program, but unlike our
approach the NTP program is produced during execution,
not before.
III. METHOD
An overview of our system is shown in Fig. 2. A camera
acquires a live video feed of a scene, and the positions and
relationships of objects in the scene are inferred in real time
by a pair of neural networks. The resulting percepts are fed
to another network that generates a plan to explain how to
recreate those percepts. Finally, an execution network reads
the plan and generates actions for the robot, taking into
account the current state of the world in order to ensure
robustness to external disturbances.
A. Perception networks
Given a single image, our perception networks infer the lo-
cations of objects in the scene and their relationships. These
networks perform object detection with pose estimation, as
well as relationship inference.
1) Image-centric domain randomization: Each object of
interest in this work is modeled by its bounding cuboid
consisting of up to seven visible vertices and one hidden
vertex. Rather than directly mapping from images to object
评论