Google Research

ALFRED-L: Investigating the Role of Language for Action Learning in Interactive Visual Environments

  • Arjun R. Akula
  • Spandana Gella
  • Aishwarya Padmakumar
  • Mahdi Namazifar
  • Mohit Bansal
  • Jesse Thomason
  • Dilek Hakkani-Tur
Conference on Empirical Methods in Natural Language Processing (EMNLP) (2022)


Embodied Vision and Language Task Completion requires an embodied agent to interpret natural language instructions and egocentric visual observations to navigate through and interact with environments. In this work, we examine ALFRED, a challenging benchmark for embodied task completion, with the goal of gaining insight into how effectively models utilize language. We find evidence that sequence-to-sequence and transformer-based models trained on this benchmark are not sufficiently sensitive to changes in input language instructions. Next, we construct a new test split -- ALFRED-L to test whether ALFRED models can generalize to task structures not seen during training that intuitively require the same types of language understanding required in ALFRED. Evaluation of existing models on ALFRED-L suggests that (a) models are overly reliant on the sequence in which objects are visited in typical ALFRED trajectories and fail to adapt to modifications of this sequence and (b) models trained with additional augmented trajectories are able to adapt relatively better to such changes in input language instructions.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work