ALFRED-L: Investigating the Role of Language for Action Learning in Interactive Visual Environments

Spandana Gella
Aishwarya Padmakumar
Mahdi Namazifar
Mohit Bansal
Jesse Thomason
Dilek Hakkani-Tur
Conference on Empirical Methods in Natural Language Processing (EMNLP)(2022)
Google Scholar


Embodied Vision and Language Task Completion requires an embodied agent to interpret natural language instructions and egocentric visual observations to navigate through and interact with environments. In this work, we examine ALFRED, a challenging benchmark for embodied task completion, with the goal of gaining insight into how effectively models utilize language. We find evidence that sequence-to-sequence and transformer-based models trained on this benchmark are not sufficiently sensitive to changes in input language instructions. Next, we construct a new test split -- ALFRED-L to test whether ALFRED models can generalize to task structures not seen during training that intuitively require the same types of language understanding required in ALFRED. Evaluation of existing models on ALFRED-L suggests that (a) models are overly reliant on the sequence in which objects are visited in typical ALFRED trajectories and fail to adapt to modifications of this sequence and (b) models trained with additional augmented trajectories are able to adapt relatively better to such changes in input language instructions.