Jump to Content
Mostafa Dehghani

Mostafa Dehghani

I'm a Research Scientist at Google Brain, where I work on machine learning, in particular, deep learning. My areas of interest include self-supervised learning, generative models, training giant models, and sequence modeling. Before Google, I was doing a PhD at the University of Amsterdam. My PhD research was focused on improving the process of learning with imperfect supervision. I explored ideas around using injecting inductive biases into algorithms, incorporating prior knowledge, and meta-learning the properties of the data using the data itself, in order to help learning algorithms to better learn from noisy or/and limited data. You can know more about me here: mostafadehghani.com.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. We publicly release Flax-based T5X model checkpoints for the 20B model. View details
    Dual PatchNorm
    Transactions on Machine Learning Research (2023) (to appear)
    Preview abstract We discover that just placing two LayerNorms: before and after the patch embedding layer leads to improvements over well-tuned ViT models. In particular, this outperforms exhaustive search for alternative LayerNorm placement strategies in the transformer block itself. View details
    DSI++: Updating Transformer Memory with New Documents
    Yi Tay
    Jinfeng Rao
    Emma Strubell
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
    Preview abstract Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1% over competitive baselines for NQ and requires 6 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence. View details
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Retrieval Enhanced Machine Learning
    Hamed Zamani
    SIGIR 2022: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Perspectives Track)
    Preview abstract Information access systems have supported people during tasks across a variety of domains. In this perspective paper, we advocate for broadening the scope of information access research to include machines. We believe that machine learning can be substantially advanced by developing a research program around retrieval as a core algorithmic method. This paper describes how core principles of indexing, representation, retrieval, and relevance can extend supervised learning algorithms. It proposes a generic retrieval-enhanced machine learning (REML) framework and describes challenges in and opportunities introduced by implementing REML. We also discuss different optimization approaches for training REML models and review a number of case studies that are simplified and special implementations of the proposed framework. The research agenda introduced in this paper will smooth the path towards developing machine learning models with better scalability, sustainability, effectiveness, and interpretability. View details
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
    Ashish Teku Vaswani
    Dani Yogatama
    Hyung Won Chung
    Jinfeng Rao
    Liam B. Fedus
    Samira Abnar
    Sharan Narang
    Yi Tay
    ICLR (2022)
    Preview abstract Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster. We conclude by demonstrating that our improved scaling protocol also holds in other domains. View details
    Preview abstract Recent developments in large-scale machine learning have created a tempting picture suggesting that by scaling up data, model size and training time properly, one can obtain a model that can be used successfully in few-shot settings in all downstream tasks. In this work, we investigate this premise empirically and provide a strong case against it. In particular, we consider image recognition task with large scale models (Vision Transformers) trained on the largest scale of available data (JFT). We show that as we improve the performance of upstream task either by scaling up or hyper-parameter and architectural choices, the performance of many downstream tasks eventually plateau. We showcase an even more extreme scenario where performance on upstream and downstream contradict each other, i.e., in order to have a better downstream performance, we need to hurt upstream accuracy. We delve deeper into understanding the reasons that give rise to these phenomena by designing interventions and investigating different components of the models which gives us crude yet useful insights into the mechanisms behind these observations. View details
    Preview abstract Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet. View details
    Preview abstract Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub (https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). View details
    Preview abstract Recent advances in Transformer-based large language models (LLMs) achieved significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a method for dynamically allocating different amounts of compute per example and per generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse generation tasks, we demonstrate the efficacy of our method in reliably reducing compute while maintaining high performance. View details
    Preview abstract In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this claim does not hold for integer discrete flows due to the embedding of data with finite support into the countably infinite integer lattice. Furthermore, we zoom in on the effect of gradient bias due to the straight-through estimator in integer discrete flows, and demonstrate that its influence is highly dependent on architecture choices and less prominent than previously thought. Finally, we show how different modifications to the architecture improve the performance of this model class for lossless compression. View details
    Preview abstract In the era of pretrained language models, transformers are the defacto choice of model architectures. While recent works have shown promise in entirely convolutional based architectures, these CNN-based models have not been widely adopted or evaluated under the pretrain-finetune paradigm. In the context of language models, are convolutional models competitive when pretrained? This paper investigates this research question and presents several interesting findings. Across a set of extensive experiments, our findings show that CNN-based pretrained models are highly competitive and outperform Transformer-based pretrained models in certain scenarios, albeit with caveats. Overall, the findings of this paper should implore the broader academic community to perhaps not conflate pretraining advances with architectural advances and both set of techniques could be studied in isolation. View details
    TokenLearner: Adaptive Space-Time Tokenization for Videos
    Michael Ryoo
    Anurag Arnab
    Conference on Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract In this paper, we present an approach for representation learning from videos. Instead of relying on hand-designed splitting strategies to obtain space-time tokens from videos, our approach learns to mine important tokens in video frames. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise interactions between such tokens over a longer temporal horizon. We introduce a vector transformer to capture such pairwise space-time relations, and a technique to fuse the transformed tokens while learning their spatio-temporal patterns. The proposed approach is designed with the intention to allow the tokenizer to adaptively react to input video frames containing diverse visual content, and then to have the vector transformer and subsequent modules learn the underlying spatio-temporal interactions and long-range dependencies in video inputs. We show the effectiveness of the proposed approach over challenging video classification datasets, outperforming the state-of-the-art, despite using much less compute. We further conduct extensive ablation experiments to study the method. View details
    Preview abstract Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable performance to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative performance amongst many models. This paper proposes a systematic and unified benchmark, LRA a benchmark specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural and synthetic images, and mathematical expressions requiring similarity, structural and visual-spatial reasoning. We systematically evaluate ten well established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. View details
    Preview abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train. View details
    Preview abstract This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena. View details
    Preview abstract Having the right inductive biases can be crucial in many tasks or scenarios where data or computing resources are a limiting factor, or where training data is not perfectly representative of the conditions at test time. However, defining, designing and efficiently adapting inductive biases is not necessarily straightforward. In this paper, we explore the power of knowledge distillation for transferring the effect of inductive biases from one model to another. We consider families of models with different inductive biases, LSTMs vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios where having the right inductive biases is critical. We study how the effect of inductive biases is transferred through knowledge distillation, in terms of not only performance but also different aspects of converged solutions. View details
    Preview abstract Weather forecasting is a long standing scientific challenge with direct social and economic impact. The task is suitable for deep neural networks due to vast amounts of continuously collected data and a rich spatial and temporal structure that presents long range dependencies. We introduce MetNet, a neural network that forecasts precipitation up to 8 hours into the future at the high spatial resolution of 1 km and at the temporal resolution of 2 minutes with a latency in the order of seconds. MetNet takes as input radar and satellite data and forecast lead time and produces a probabilistic precipitation map. The architecture uses axial self-attention to aggregate the global context from a large input patch corresponding to a million square kilometers. We evaluate the performance of MetNet at various precipitation thresholds and find that MetNet outperforms Numerical Weather Prediction at forecasts of up to 7 to 8 hours on the scale of the continental United States. View details
    Universal Transformers
    Stephan Gouws
    Jakob Uszkoreit
    Lukasz Kaiser
    ICLR (2019)
    Preview abstract Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset. View details
    Fidelity-Weighted Learning
    Arash Mehrjou
    Stephan Gouws
    Jaap Kamps
    Bernhard Scholkopf
    ICLR (2018)
    Preview abstract Learning meaningful and useful task-dependent data representations requires many training instances -- but training labels are expensive to obtain, and may be of varying quality. This creates a fundamental quality-versus-quantity trade-off in the learning process. Do we learn from the small amount of high-quality data or the potentially large amount of weakly-labeled data (obtained from heuristics or crowd-sourcing, etc.)? We argue that if we could somehow know and take the label-quality into account when learning the data representation, we could get the best of both worlds. To this end, we propose ``fidelity-weighted learning'' (\fwl), a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. \fwl modulates the parameter updates to a \emph{student} network (trained on the task we care about) on a per-sample basis according to the posterior confidence of the label-quality estimated by a \emph{teacher}. Both student and teacher are learned from the data. We evaluate \fwl on two real-world tasks in information retrieval and natural language processing where we outperform state-of-the-art alternative semi-supervised methods, indicating that our approach makes better use of the label information and results in better task-dependent data representations. View details
    Preview abstract In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model. View details
    Preview abstract Making use of weak or noisy signals, like the output of heuristic methods or user click through data for training deep neural networks is increasing, in particular for the tasks where an adequate amount of data with true labels is not available. In a semi-supervised setting, we can use a large set of data with weak labels to pretrain a neural network and fine tune the parameters with a small amount of data with true labels. However, these two independent stages do not leverage the full capacity of clean information from true labels during pretraining. In this paper, we propose a semi-supervised learning method where we train two neural networks in a multi-task fashion: a target network and a confidence network. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to weight the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model. We evaluate our learning strategy on two different tasks: document ranking and sentiment classification. The results demonstrate that our approach not only enhances the performance compared to the baselines but also speeds up the learning process from weak labels. View details
    Neural Ranking Models with Weak Supervision
    Hamed Zamani
    Jaap Kamps
    W. Bruce Croft
    Proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2017)
    Preview abstract Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision, natural language processing, and speech recognition tasks, such improvements have not generally been observed in ranking for information retrieval. The reason might be related to the complexity of the ranking problem, in the sense that it is not obvious how to learn from queries and documents when no supervised signal is available. Hence, in this paper, we propose to train a neural ranking model from a weak supervision signal, which is a training signal that can be obtained automatically without human labeling or any external resources (e.g., click data). To this aim, we use the output of a known unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet e‚ffective ranking models based on feed-forward neural networks. We study their e‚ffectiveness under various learning scenarios (point-wise and pair-wise models) and using di‚fferent input representations (i.e., from encoding query-document pairs into dense/sparse vectors to using word embedding representation). We train our network on 5 million unique queries obtained from the publicly available AOL query logs and two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that feeding raw data to the networks and letting them learn representations for the input data leads to an impressive performance, with over 13% and 35% MAP improvements compared to the BM25 model on the Robust and the ClueWeb collections, respectively. Our findings suggest that neural ranking models can greatly benefit from large amounts of weakly labeled data that can be easily obtained from unsupervised IR models. View details
    Preview abstract Users try to articulate their complex information needs during search sessions by reformulating their queries. In order to make this process more effective, search engines provide related queries to help users to specify the information need in their search process. In this paper, we propose a customized sequence-to-sequence model for session-based query suggestion.In our model, we employ a query-aware attention mechanism to capture the structure of the session context. This enables us to control the scope of the session from which we infer the suggested next query, which helps not only handle the noisy data but also automatically detect session boundaries. Furthermore, we observe that based on user query reformulation behavior, a large portion of terms of a query in a session is retained from the previously submitted queries in the same session and consists of mostly infrequent or unseen terms that are usually not included in the vocabulary. We therefore empower the decoder of our model to access the source words from the session context during decoding by incorporating a copy mechanism. Moreover, we propose evaluation metrics to assess the quality of the generative models for query suggestion. We conduct an extensive set of experiments and analysis. The results suggest that our model outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion. View details
    No Results Found