Efficient Adaptive Image-Language Learning for Visual Question Answering

CVPR Workshop on Transformers for Vision (T4V) (2022)
Google Scholar

Abstract

We present a novel efficient image-language learning model for multi-task visual question answering tasks which works at a fraction of the computational cost. New compact features are learned adaptively to jointly represent the image and language modalities according to the data. Our method outperforms the state-of-the-art multi-task approaches on SNLI-VE and GQA, and works competitively on VQA2.0. The model is highly efficient using 7-10 fewer GFLOPs and scales well to more than twice the input
image size.

Research Areas