HyperGrid Transformers: Towards A Single Model for Multiple Tasks
Abstract
Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections, which helps to specialize regions in weight matrices for different tasks. In order to construct the proposed hyper projection, our method learns the interactions and composition between a global state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, yielding optimistic and strong gains across GLUE and SuperGLUE benchmarks when trained in a single model multi-tasking setup. Our method helps to bridge the gap between the single-task finetune methods and the single model multi-tasking approaches