Grasp2Vec: Learning Object Representations from Self-Supervised Grasping
Abstract
Robotic learning algorithms based on reinforcement, self-supervision,
and imitation can acquire end-to-end controllers from raw sensory
inputs such as images. These end-to-end controllers acquire perception
systems that are tailored to the task, picking up on the cues that are
most useful for the task at hand. However, to learn generalizable
robotic skills, we might prefer more structured image representations,
such as ones encoding the persistence of objects and their identities.
In this paper, we study a specific instance of this problem: acquiring
object representations through autonomous robotic interaction with its
environment.
Our representation learning method is based on object persistence:
when a robot picks up an object and ``subtracts'' it from the scene,
its representation of the scene should change in a predictable way. We
can use this observation to formulate a simple condition that an
object-centric representation should satisfy: the features
corresponding to a scene should be approximately equal to the feature
values for the same scene after an object has been removed, minus the
feature value for that object.
and imitation can acquire end-to-end controllers from raw sensory
inputs such as images. These end-to-end controllers acquire perception
systems that are tailored to the task, picking up on the cues that are
most useful for the task at hand. However, to learn generalizable
robotic skills, we might prefer more structured image representations,
such as ones encoding the persistence of objects and their identities.
In this paper, we study a specific instance of this problem: acquiring
object representations through autonomous robotic interaction with its
environment.
Our representation learning method is based on object persistence:
when a robot picks up an object and ``subtracts'' it from the scene,
its representation of the scene should change in a predictable way. We
can use this observation to formulate a simple condition that an
object-centric representation should satisfy: the features
corresponding to a scene should be approximately equal to the feature
values for the same scene after an object has been removed, minus the
feature value for that object.