M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong; Anurag Arnab; Arsha Nagrani; Cordelia Schmid

M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong

Anurag Arnab

Arsha Nagrani

Cordelia Schmid

University of Bristol

Download Google Scholar

Abstract

This report describes the approach behind our submission to the 2022 Epic-Kitchens Action Recognition Challenge from team Google Research Grenoble. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M\&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year’s winning entry.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

M&M Mix: A Multimodal Multiview Transformer Ensemble

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs