Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Jialin Wu; Xia Hu; Yaqing Wang; Bo Pang; Radu Soricut

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Jialin Wu

Xia Hu

Yaqing Wang

Bo Pang

Radu Soricut

Computer Vision and Pattern Recognition (2024)

Download Google Scholar

Abstract

Specialized Large multi-modal models (LMMs) have exhibited remarkable performance across numerous tasks, however, generalist LMMs suffer from performance degradation when training with a large collection of tasks. Recent research suggests Mixture of Experts (MoE) Models help instruction tuning, however, for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use.
We propose Omni-SMoLA that softly mixes many multimodal low rank experts to large models without introducing significant new parameter count compared to conventional MoE models. The core idea is that the large model provides a foundational backbone and different lightweight experts learn specialized knowledge residually. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of visual question answering and captioning tasks, achieving a new state-of-the-art generalist performance that matches or outperforms single specialized LMM baselines.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs