Robotic learning algorithms based on reinforcement, self-supervision, and imitation can acquire end-to-end controllers from raw sensory inputs such as images. These end-to-end controllers acquire perception systems that are tailored to the task, picking up on the cues that are most useful for the task at hand. However, to learn generalizable robotic skills, we might prefer more structured image representations, such as ones encoding the persistence of objects and their identities. In this paper, we study a specific instance of this problem: acquiring object representations through autonomous robotic interaction with its environment.
Our representation learning method is based on object persistence: when a robot picks up an object and ``subtracts'' it from the scene, its representation of the scene should change in a predictable way. We can use this observation to formulate a simple condition that an object-centric representation should satisfy: the features corresponding to a scene should be approximately equal to the feature values for the same scene after an object has been removed, minus the feature value for that object.