Answer-Me: Multi-Task Open-Vocabulary Learning for Visual Question-Answering
Abstract
We present Answer-Me, a task-aware multi-task framework which unifies multiple question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pretrain a vision-language joint model, which is multi-task as well, and uses the entire architecture end-to-end. Our results, which are in the challenging open-vocabulary generative setting, show state-of-the-art performance, zero-shot generalization, robustness to forgetting.