Joint Adaptive Representations for Image-Language Learning

Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR)(2023)
Google Scholar


Image-language transformer models have achieved tremendous success, but they come at high computational costs. We here propose a joint adaptive image-language representation learning, which adaptively and iteratively fuses the multi-modal features. This consistently reduces the model cost and size, allows the model to scale without a large increase in FLOPs or memory, and outperforms bigger and much more expensive models. With only 40M training examples and with 39 GFLOPs our model outperforms many times larger models, some reaching 800 GFLOPs.