Arsha Nagrani

Arsha Nagrani

I joined Google as a Research Scientist in 2020, after completing my PhD in Computer Vision at the University of Oxford, UK. My research is focused on multimodal machine learning techniques for video understanding, including using sound and text to learn better representations. Check out my homepage for more about me and my research.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    UnLoc: a unified framework for video localization tasks
    Shen Yan
    Xuehan Xiong
    Anurag Arnab
    Zhonghao Wang
    Weina Ge
    International Conference on Computer Vision (2023)
    Preview abstract We adapt large-scale image-text pretrained models such as CLIP for temporal localization tasks in untrimmed videos, which is still a relatively unexplored task. We do so by designing a new approach called UnLoc, which uses a pretrained image and text tower, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables zero-shot Moment Retrieval, TAL and action segmentation with a single stage model, without the need for action proposals or representation masking. Unlike specialised models, we achieve state of the art results on three different localization tasks with a unified approach - in some cases outperforming previous works by large margins. View details
    Preview abstract Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech. View details
    Preview abstract YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing nstructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark. View details
    Preview abstract Recent video and language pretraining frameworks lack the ability to generate sentences, and are limited in transferring to generative tasks such as multimodal video captioning. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled instructional videos where the pretrained model is effectively transferred to video captioning tasks. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of the captions in the unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. We use this objective to train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for video captioning on four standard benchmarks, as well as on other video understanding tasks such as VideoQA, video retrieval and action classification. View details
    Learning Audio-Video Modalities from Image Captions
    Paul Hongsuck Seo
    Anja Hauth
    Santiago Manen
    European Conference on Computer Vision (2022)
    Preview abstract There has been a recent explosion of large-scale image-text datasets, as images with alt-text captions can be easily obtained online.Obtaining large-scale, high quality data for video in the form of text-video and text-audio pairs however, is more challenging. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformer based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval. View details
    Preview abstract This report describes the approach behind our submission to the 2022 Epic-Kitchens Action Recognition Challenge from team Google Research Grenoble. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M\&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year’s winning entry. View details
    AVATAR: Unconstrained Audiovisual Speech Recognition
    Valentin Gabeur
    Paul Hongsuck Seo
    Karteek Alahari
    Interspeech (2022)
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions. View details
    Masking Modalities for Cross-modal Video Retrieval
    Valentin Gabeur
    Karteek Alahari
    Winter Conference on Applications of Computer Vision (WACV) (2022) (to appear)
    Preview abstract Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our `modality masking' pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets. View details
    Preview abstract We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. We show that representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks. View details