Jump to Content
Arsha Nagrani

Arsha Nagrani

I joined Google as a Research Scientist in 2020, after completing my PhD in Computer Vision at the University of Oxford, UK. My research is focused on multimodal machine learning techniques for video understanding, including using sound and text to learn better representations. Check out my homepage for more about me and my research.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech. View details
    UnLoc: a unified framework for video localization tasks
    Xuehan Xiong
    Anurag Arnab
    Zhonghao Wang
    Weina Ge
    International Conference on Computer Vision (2023)
    Preview abstract We adapt large-scale image-text pretrained models such as CLIP for temporal localization tasks in untrimmed videos, which is still a relatively unexplored task. We do so by designing a new approach called UnLoc, which uses a pretrained image and text tower, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables zero-shot Moment Retrieval, TAL and action segmentation with a single stage model, without the need for action proposals or representation masking. Unlike specialised models, we achieve state of the art results on three different localization tasks with a unified approach - in some cases outperforming previous works by large margins. View details
    Preview abstract Recent video and language pretraining frameworks lack the ability to generate sentences, and are limited in transferring to generative tasks such as multimodal video captioning. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled instructional videos where the pretrained model is effectively transferred to video captioning tasks. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of the captions in the unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. We use this objective to train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for video captioning on four standard benchmarks, as well as on other video understanding tasks such as VideoQA, video retrieval and action classification. View details
    Preview abstract This report describes the approach behind our submission to the 2022 Epic-Kitchens Action Recognition Challenge from team Google Research Grenoble. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M\&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year’s winning entry. View details
    Preview abstract YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing nstructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark. View details
    Learning Audio-Video Modalities from Image Captions
    Paul Hongsuck Seo
    Anja Hauth
    Santiago Manen
    European Conference on Computer Vision (2022)
    Preview abstract There has been a recent explosion of large-scale image-text datasets, as images with alt-text captions can be easily obtained online.Obtaining large-scale, high quality data for video in the form of text-video and text-audio pairs however, is more challenging. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformer based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval. View details
    AVATAR: Unconstrained Audiovisual Speech Recognition
    Valentin Gabeur
    Paul Hongsuck Seo
    Karteek Alahari
    Interspeech (2022)
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions. View details
    Masking Modalities for Cross-modal Video Retrieval
    Valentin Gabeur
    Karteek Alahari
    Winter Conference on Applications of Computer Vision (WACV) (2022) (to appear)
    Preview abstract Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our `modality masking' pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets. View details
    Preview abstract Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks. A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions (\textit{`late-fusion'}). Instead, we propose a new architecture that learns to model both unimodal and cross-modal information at earlier stages, without imposing any modality specific priors. We investigate two pathways for the exchange of cross-modal information, \textit{vertical attention} (by restricting crossmodal fusion to certain layers) and \textit{horizontal attention}, via the use of `fusion bottleneck' tokens, that encourage the model to extract and exchange relevant information between modalities in an efficient manner. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released. View details
    Preview abstract We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. We show that representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks. View details
    Preview abstract While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Prediction task. Our task involves predicting the next utterance in a video, using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone. Further, we demonstrate that our model trained for this task on unlabelled videos achieves state-of-the-art performance on a number of downstream VideoQA benchmarks such as MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. View details
    Recognizing Multimodal Entailment (tutorial at ACL 2021)
    Afsaneh Hajiamin Shirazi
    Blaž Bratanič
    Christina Liu
    Gabriel Fedrigo Barcik
    Georg Fritz Osang
    Jared Frank
    Lucas Smaira
    Ricardo Abasolo Marino
    Roma Patel
    Vaiva Imbrasaite
    (2021) (to appear)
    Preview abstract How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web. With ever-larger amounts of unstructured and limited labels, organizing and reconciling information from different sources and modalities is a central challenge in machine learning. This cutting-edge tutorial aims to introduce the multimodal entailment task, which can be useful for detecting semantic alignments when a single modality alone does not suffice for a whole content understanding. Starting with a brief overview of natural language processing, computer vision, structured data and neural graph learning, we lay the foundations for the multimodal sections to follow. We then discuss recent multimodal learning literature covering visual, audio and language streams, and explore case studies focusing on tasks which require fine-grained understanding of visual and linguistic semantics question answering, veracity and hatred classification. Finally, we introduce a new dataset for recognizing multimodal entailment, exploring it in a hands-on collaborative section. Overall, this tutorial gives an overview of multimodal learning, introduces a multimodal entailment dataset, and encourages future research in the topic. View details
    Preview abstract Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind. A major contributing factor has been the prohibitive cost of annotating videos frame-by-frame. In this paper, we present a spatio-temporal action recognition that is trained with only video-level labels, which are significantly easier to annotate, and can even be mined automatically (subject to some label noise). Our method leverages per-frame person detectors which have been trained on large image datasets within a Multiple Instance Learning framework. We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid using a novel probabilistic variant of MIL. Furthermore, we report the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly-supervised methods on UCF101-24. View details
    Preview abstract Can we guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie scripts describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a speech to action classifier on 1k movie scripts downloaded from IMSDb and show that such a classifier performs well for certain classes, and when applied to the speech segments of a large \textit{unlabelled} movie corpus (288k videos, 188M speech segments), provides weak labels for over 800k video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single labelled action example. View details
    No Results Found