Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Abstract
Image-language transformer models have achieved tremendous success, but they come at high computational costs. We here propose a joint adaptive image-language representation learning, which adaptively and iteratively fuses the multi-modal features. This consistently reduces the model cost and size, allows the model to scale without a large increase in FLOPs or memory, and outperforms bigger and much more expensive models. With only 40M training examples and with 39 GFLOPs our model outperforms many times larger models, some reaching 800 GFLOPs.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work