To efficiently operate in previously unseen environments, robots must be able to build a map – an internal representation of the environment – even from a small number of observations. But how should that map be represented and which information should be stored in it, to enable downstream tasks, for example localization? Classic approaches use a fixed map representation with strong spatial structure, such as voxels or point clouds, which makes them applicable to wide range of robotic tasks. Data-driven approaches, on the other hand, are able to learn rich and robust representations by optimizing them directly for a downstream task. Eslami et al., for example, learn to construct representations of simulated environments from a few images that allow them to generate images from novel viewpoints. The challenge for learning in complex environments is choosing suitable priors that enable generalization while having only limited amount of data for training. A desirable approach would combine the best of both worlds: retain the spatial structure of the classic approaches, but also leverage the power of deep neural networks to learn a flexible and effective map representation for the downstream task. In this paper we explore how structure and learning can be combined in the context of a sparse visual localization task.