Google Research

M&M Mix: A Multimodal Multiview Transformer Ensemble

University of Bristol


This report describes the approach behind our submission to the 2022 Epic-Kitchens Action Recognition Challenge from team Google Research Grenoble. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M\&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year’s winning entry.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work