Room-Across-Room (RxR) is the first multilingual dataset for Vision-and-Language Navigation (VLN). It contains 126,069 human-annotated navigation instructions in three typologically diverse languages — English, Hindi and Telugu. Each instruction describes a path through a photorealistic simulator populated with indoor environments from the Matterport3D dataset, which includes 3D captures of homes, offices and public buildings.
In addition to navigation instructions and paths, RxR also includes a new type of multimodal annotation called a pose trace. Inspired by the mouse traces captured in the Localized Narratives dataset, pose traces provide dense groundings between language, vision and movement in a rich 3D setting. To generate navigation instructions, we ask guide annotators to move along a path in the simulator while narrating the path based on the surroundings. The pose trace is a record of everything the guide sees along the path, time-aligned with the words in the navigation instructions. These traces are then paired with pose traces from follower annotators, who are tasked with following the intended path by listening to the guide’s audio, thereby validating the quality of the navigation instructions. Pose traces implicitly capture notions of landmark selection and visual saliency, and represent a play-by-play account of how to solve the navigation instruction generation task (for guides) and the navigation instruction following task (for followers).
To track progress towards agents that can navigate complex human environments in response to spoken or written commands, we have also launched the RxR Challenge, a competition that encourages the machine learning community to train and evaluate their own instruction following agents on RxR instructions. PanGEA, the annotation tool we developed to collect RxR, is open-sourced on github.