In robotic application we often face the challenge of detecting instances of objects for which we have neither trained models or very little labeled data. In this paper we propose to use self-supervisory signals, generated without human supervision by a robot exploring an environment, to learn a representation of the novel object instances present in this environment. We demonstrate the utility of this representation in two ways. First, we can automatically discover objects by performing clustering in this space. Each resulting cluster contains examples of one instance seen from various viewpoints and scales. Second, if given a small number of labeled images, we can learn efficiently detectors for these labels. In the few-shot regime these detectors have a substantially higher mAP of XX compared to off-the-shelf standard detectors trained on this limited data. Thus, the self-supervision results in efficient and performant object discovery and detection at no or very small human labeling cost.