Robust Visual Reasoning via Language Guided Neural Module Networks

Arjun R. Akula; Varun Jampani; Beer Changpinyo; Song-Chun Zhu

Robust Visual Reasoning via Language Guided Neural Module Networks

Arjun R. Akula

Varun Jampani

Beer Changpinyo

Song-Chun Zhu

NeurIPS (2021)

Google Scholar

Abstract

Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMNs is that the neural modules do not capture the association between the visual input and the relevant neighbourhood context of the textual input. This limits their generalizability. or instance, NMNs fail to understand new concepts such as "yellow sphere to the left" even when it is a combination of known concepts from train data: "blue sphere", "yellow cube", and "metallic cube to the left". In this paper, we address this limitation by introducing a language-guided adaptive convolution layer (LG-Conv) into NMN, in which the filter weights of convolutions are explicitly multiplied with a spatially varying language-guided kernel. Our model allows the neural module to adaptively co-attend over potential objects of interest from the visual and textual inputs. Extensive experiments on VQA and REF tasks demonstrate the effectiveness of our approach. Additionally, we propose a new challenging out-of-distribution test split for REF task, which we call C3-Ref+, for explicitly evaluating the NMN's ability to generalize well to adversarial perturbations and unseen combinations of known concepts. Experiments on C3-Ref+ further demonstrate the generalization capabilities of our approach.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Robust Visual Reasoning via Language Guided Neural Module Networks

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs