Jump to Content
Donald Metzler

Donald Metzler

Donald Metzler is a Senior Staff Research Scientist at Google Inc. Prior to that, he was a Research Assistant Professor at the University of Southern California (USC) and a Senior Research Scientist at Yahoo!. He has served as the Program Chair of the WSDM, ICTIR, and OAIR conferences and sat on the editorial boards of all the major journals in his field. He has published over 100 research papers, has been awarded 9 patents, and is a co-author of "Search Engines: Information Retrieval in Practice". He currently leads a research group focused on a variety of problems at the intersection of machine learning, natural language processing, and information retrieval.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities. View details
    DSI++: Updating Transformer Memory with New Documents
    Yi Tay
    Jinfeng Rao
    Emma Strubell
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
    Preview abstract Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1% over competitive baselines for NQ and requires 6 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence. View details
    Preview abstract Recent work has shown that Large Language Models (LLMs) can effectively re-rank the outputs of BM25 retrieval. This is achieved zero-shot by including task-specific instructions. However, for tasks that require scoring instead of generation, few-shot prompting remains underexplored. In this work, we improve LLM-based re-ranking performance by including demonstrations in the prompt. We show that adding even a single demonstration makes a significant impact. Our detailed analysis investigates under which conditions demonstrations are the most helpful. We propose a novel difficulty-based demonstration selection strategy instead of using the commonly used approach of semantic similarity. Furthermore, we show that demonstrations helpful for ranking are also effective at question generation. We hope our research will facilitate further studies into both question generation and passage re-ranking. View details
    Preview abstract Popularized by the Differentiable Search Index, the emerging paradigm of Generative Retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus into the parameters of a single transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task consisting of 8.8M passages. After ablating for the most promising techniques, we then consider model scales up to 11B parameters. Along the way, we uncover several findings about scaling generative retrieval to millions of passages. Notably, the use of synthetic query generation as document representation is the only modeling technique critical to retrieval effectiveness. In addition, we find that the strongest performing architecture modifications from the literature at T5-Base initialization only perform well due to added parameters. Naively scaling to a comparable model size outperforms these proposed techniques. Finally, while model scale is necessary as corpus size increases, we find that given existing techniques, scaling model parameters past a certain point can be detrimental for retrieval effectiveness. This result might be counter-intuitive to the commonly held belief that model capacity is a limiting factor for scaling generative retrieval to larger corpora, and suggests the need for more fundamental improvements. In general, we believe that these findings will be highly valuable for the community to clarify the state of generative retrieval at scale and highlight the challenges currently facing the paradigm. View details
    Preview abstract Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. We publicly release Flax-based T5X model checkpoints for the 20B model. View details
    Preview abstract Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be seen as a sequence of related segments (e.g., the sequence of sentences within a passage, or the hypothesis and premise in NLI). While attending across these segments is highly beneficial for many tasks, we hypothesize that this interaction can be delayed until later encoding stages. To this end, we introduce Layer-adjustable Interactions in Transformers (LAIT). Within LAIT, segmented inputs are first encoded independently, and then jointly. This partial two-tower architecture bridges the gap between a Dual Encoder's ability to pre-compute representations for segments and a fully self-attentive Transformer's capacity to model cross-segment attention. Also, LAIT can be introduced only when finetuning, effectively converting an existing pretrained Transformer into the hybrid of the two aforementioned architectures, and providing an intuitive control over the performance-efficiency tradeoff. Experimenting on a wide range of NLP tasks, we find LAIT to significantly improve efficiency while preserving accuracy. View details
    Emergent abilities of large language models
    Barret Zoph
    Colin Raffel
    Dani Yogatama
    Jason Wei
    Liam B. Fedus
    Maarten Paul Bosma
    Percy Liang
    Sebastian Borgeaud
    Tatsunori B. Hashimoto
    Yi Tay
    TMLR (2022)
    Preview abstract Scaling up language models has been shown to predictably confer a range of benefits such as improved performance and sample efficiency. This paper discusses an unpredictable phenomenon that we call emergent abilities of large language models. Such emergent abilities have close to random performance until evaluated on a model of sufficiently large scale, and hence their emergence cannot be predicted by extrapolating a scaling law based on small-scale models. The emergence of such abilities suggests that additional scaling could further expand the range of tasks that language models can perform. We discuss the implications of these phenomena and suggest directions for future research. View details
    Preview abstract Prompt-tuning is becoming a new paradigm for finetuning pre-trained language models in a parameter-efficient way. Here, we explore the use of HyperNetworks to generate prompts. We propose a novel architecture of HyperPrompt: prompt-based task-conditioned parameterization of self-attention in Transformers. We show that HyperPrompt is very competitive against strong multi-task learning baselines with only 1% of additional task-conditioning parameters. The prompts are end-to-end learnable via generation by a HyperNetwork. The additional parameters scale sub-linearly with the number of downstream tasks, which makes it very parameter efficient for multi-task learning. Hyper-Prompt allows the network to learn task-specific feature maps where the prompts serve as task global memories. Information sharing is enabled among tasks through the HyperNetwork to alleviate task conflicts during co-training. Through extensive empirical experiments, we demonstrate that HyperPrompt can achieve superior performances over strong T5 multi-task learning base-lines and parameter-efficient adapter variants including Prompt-Tuning on Natural Language Understanding benchmarks of GLUE and Super-GLUE across all the model sizes explored. View details
    Preview abstract We argue that current IR metrics, modeled on optimizing user experience, measure too narrow a portion of the IR space. If IR systems are weak, these metrics undersample or completely filter out the deeper documents that need improvement. If IR systems are relatively strong, these metrics undersample deeper relevant documents that could underpin even stronger IR systems, ones that could present content from tens or hundreds of relevant documents in a user-digestible hierarchy or text summary. We reanalyze over 70 TREC tracks from the past 28 years, showing that roughly half undersample top ranked documents and nearly all undersample tail documents. We show that in the 2020 Deep Learning tracks, neural systems were actually near-optimal at top-ranked documents, compared to only modest gains over BM25 on tail documents. Our analysis is based on a simple new systems-oriented metric, ’atomized search length’, which is capable of accurately and evenly measuring all relevant documents at any depth. View details
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
    Ashish Teku Vaswani
    Dani Yogatama
    Hyung Won Chung
    Jinfeng Rao
    Liam B. Fedus
    Samira Abnar
    Sharan Narang
    Yi Tay
    ICLR (2022)
    Preview abstract Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster. We conclude by demonstrating that our improved scaling protocol also holds in other domains. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Preview abstract Recent advances in Transformer-based large language models (LLMs) achieved significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a method for dynamically allocating different amounts of compute per example and per generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse generation tasks, we demonstrate the efficacy of our method in reliably reducing compute while maintaining high performance. View details
    Preview abstract Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance. In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and out-of-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic. View details
    Stochastic Retrieval-Conditioned Reranking
    Hamed Zamani
    The ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR) 2022
    Preview abstract The multi-stage cascaded architecture has been adopted by many search engines for efficient and effective retrieval. This architecture consists of a stack of retrieval and reranking models in which efficient retrieval models are followed by effective (neural) learning to rank models. The optimization of these learning to rank models is loosely connected to the early stage retrieval models. In many cases these learning to rank models are often trained in isolation of the early stage retrieval models. This paper draws theoretical connections between the early stage retrieval and late stage reranking models by deriving expected reranking performance conditioned on the early stage retrieval results. Our findings shed light on optimization of both retrieval and reranking models. As a result, we also introduce a novel loss function for training reranking models that leads to significant improvement in multiple public benchmarks. View details
    A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
    Alyssa Whitlock Lees
    Yi Tay
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022)
    Preview abstract On the world wide web, toxic content detectors are a crucial line ofdefense against potentially hateful and offensive messages. As such,building highly effective classifiers that enable a safer internet is animportant research area. Moreover, the web is a highly multilingual,cross-cultural community that develops its own lingo over time.As such, developing models that can be effective across a diverserange of languages usages and styles is crucial. In this paper, wepresent Jigsaw Perspective API’s new generation of toxic contentclassifiers which takes a step towards this unified vision. At theheart of the approach is a single multilingual token-free Charformermodel that is applicable across languages, domains, and tasks. Wedemonstrate that by forgoing static vocabularies, we gain flexibilityacross a variety of settings. We additionally outline the techniquesemployed to make such a byte-level model efficient and feasible forproductionization. Through extensive experiments on multilingualtoxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity,emoji-based hate, human-readable obfuscation, distribution shift,and bias evaluation settings, we show that our proposed approachoutperforms strong baselines. Finally, we present our findings ofdeploying this system in production, and discuss our observedbenefits over traditional approaches View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Preview abstract Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors. View details
    Preview abstract State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end. View details
    Search and Discovery in Personal Email Collections (Tutorial Proposal)
    Proceedings of the 15th ACM International Conference on Web Search and Data Mining (2022), 1617–1619
    Preview abstract Email has been an essential communication medium for many years. As a result, the information accumulated in our mailboxes has become valuable for all of our personal and professional activities. For years, researchers have developed interfaces, models, and algorithms to facilitate email search, discovery, and organization. This tutorial brings together these diverse research directions and provides both a historical background, as well as a high-level overview of the recent advances in the field. In particular, we lay out all of the components needed in the design of email search engines, including user interfaces, indexing, document and query understanding, retrieval, ranking, evaluation, and data privacy. The tutorial also goes beyond search, presenting recent work on intelligent task assistance in email and a number of interesting future directions. View details
    Preview abstract Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training. View details
    Retrieval Enhanced Machine Learning
    Hamed Zamani
    SIGIR 2022: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Perspectives Track)
    Preview abstract Information access systems have supported people during tasks across a variety of domains. In this perspective paper, we advocate for broadening the scope of information access research to include machines. We believe that machine learning can be substantially advanced by developing a research program around retrieval as a core algorithmic method. This paper describes how core principles of indexing, representation, retrieval, and relevance can extend supervised learning algorithms. It proposes a generic retrieval-enhanced machine learning (REML) framework and describes challenges in and opportunities introduced by implementing REML. We also discuss different optimization approaches for training REML models and review a number of case studies that are simplified and special implementations of the proposed framework. The research agenda introduced in this paper will smooth the path towards developing machine learning models with better scalability, sustainability, effectiveness, and interpretability. View details
    Covid Vaccine Search Classification with Pretrained Transformers and Dense Feature Memory
    Yi Tay
    Chaitanya Kamath
    Shailesh Bavadekar
    Evgeniy Gabrilovich
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
    Preview abstract With the devastating outbreak of COVID-19, vaccines are one of the crucial lines of defense against mass infection in this global pandemic. Given the protection they provide, vaccines are becoming mandatory in certain social and professional settings. This paper presents a classification model for detecting COVID-19 vaccination related search queries, a machine learning model that is used to generate search insights for COVID-19 vaccinations. The proposed method combines and leverages advancements from modern state-of-the-art (SOTA) natural language understanding (NLU) techniques such as pretrained Transformers with traditional dense features. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task, improving a strong well-established gradient-boosting baseline by relative +15% improvement in F1 score and +14% in precision. View details
    Preview abstract In the era of pretrained language models, transformers are the defacto choice of model architectures. While recent works have shown promise in entirely convolutional based architectures, these CNN-based models have not been widely adopted or evaluated under the pretrain-finetune paradigm. In the context of language models, are convolutional models competitive when pretrained? This paper investigates this research question and presents several interesting findings. Across a set of extensive experiments, our findings show that CNN-based pretrained models are highly competitive and outperform Transformer-based pretrained models in certain scenarios, albeit with caveats. Overall, the findings of this paper should implore the broader academic community to perhaps not conflate pretraining advances with architectural advances and both set of techniques could be studied in isolation. View details
    Preview abstract Large generative language models such as GPT-2 are well-known for not only their ability to generate highly realistic text but also in their utility for common downstream tasks. However, how and in what settings one can best leverage these powerful language models is still a nascent research question. In this work, we explore their use in predicting ``language quality'', a notion of coherence and understandability of text. Our key finding is that, when trained in a self-discriminating fashion, large language models emerge as unsupervised predictors for such language quality. This enables fast bootstrapping of quality indicators in a low-resource setting. We conduct extensive qualitative and quantitative analysis over 500 million web articles, the largest-scale study conducted on this topic. View details
    Preview abstract The grammar of the natural language has two major classes: Dependency grammar that models are one-to-one correspondences between words, and Constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. In order to achieve this, we propose a new self-attention mechanism with novel hierarchical and dependency constraints. Experiment results show that our model can achieve strong results on Unsupervised Constituency parsing, Unsupervised Dependency Parsing and Masked Language Modeling. View details
    Preview abstract When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead. Classical information retrieval systems do not answer information needs directly, but instead provide references to (hopefully authoritative) answers. Successful question answering systems offer a limited corpus created on-demand by human experts, which is neither timely nor scalable. Pre-trained language models, by contrast, are capable of directly generating prose that may be responsive to an information need, but at present they are dilettantes rather than domain experts -- they do not have a true understanding of the world, they are prone to hallucinating, and crucially they are incapable of justifying their utterances by referring to supporting documents in the corpus they were trained over. This paper examines how ideas from classical information retrieval and pre-trained language models can be synthesized and evolved into systems that truly deliver on the promise of domain expert advice. View details
    Preview abstract Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable performance to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative performance amongst many models. This paper proposes a systematic and unified benchmark, LRA a benchmark specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural and synthetic images, and mathematical expressions requiring similarity, structural and visual-spatial reasoning. We systematically evaluate ten well established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. View details
    Preview abstract Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections, which helps to specialize regions in weight matrices for different tasks. In order to construct the proposed hyper projection, our method learns the interactions and composition between a global state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, yielding optimistic and strong gains across GLUE and SuperGLUE benchmarks when trained in a single model multi-tasking setup. Our method helps to bridge the gap between the single-task finetune methods and the single model multi-tasking approaches View details
    How Reliable are Model Diagnostics?
    Vamsi Aribandi
    Yi Tay
    ACL Findings 2021
    Preview abstract In the pursuit of a deeper understanding of a model's behaviour, there is recent impetus for developing suites of probes aimed at diagnosing models beyond simple metrics like accuracy or BLEU. This paper takes a step back and asks an important and timely question: how reliable are these diagnostics in providing insight into models and training setups? We critically examine three recent diagnostic tests for pre-trained language models, and find that likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed. Based on our empirical findings, we also formulate recommendations for practitioners and researchers. View details
    Preview abstract This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena. View details
    Preview abstract We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our Sinkhorn Attention remains competitive to the vanilla attention, consistently outperforming recently proposed efficient Transformer models such as Sparse Transformers, while retaining memory efficiency. View details
    Stabilizing Neural Search Ranking Models
    Ruilin Li
    Suming Jeremiah Chen
    The Web Conference 2020 (WWW)
    Preview abstract Neural search ranking models have been not only actively studied in academic research, but also widely adopted in real-world industrial applications. However, due to the high non-convexity and stochastic nature of neural model formulations, the obtained models are unstable in the sense that model predictions can vary a lot for two models trained with the same configuration. In practice, new features are continuously introduced and new model architectures are explored to improve model effectiveness. In these cases, the instability of neural models lead to unnecessary document ranking changes for a large portion of queries. Such changes not only lead to inconsistent user experience, but also add noise to online experimentation and can slow down model improvement cycles. How to stabilize neural search ranking models during model update is an important but largely unexplored problem. Motivated by trigger analysis, we suggest to balance the trade-off between performance improvement and the amount of affected queries. Concretely, we formulate it as an optimization problem with the objective as maximizing the average effect over the affected queries. We propose two heuristics and one theory-guided stabilization methods to solve the optimization problem. Our proposed methods are evaluated on two of the world's largest personal search services: Gmail search and Google Drive search. Empirical results show that our proposed methods are very effective in optimizing the proposed objective and are applicable to different model update scenarios. View details
    Separate And Attend in Personal Email Search
    Yu Meng
    Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM) (2020)
    Preview abstract In personal email search, user queries often impose different requirements on different aspects of the retrieved emails. For example, the query "my recent flight to the US'" requires emails to be ranked based on both textual contents and recency of the email documents, while other queries such as "medical history'" do not impose any constraints on the recency of the email. Recent deep learning-to-rank models for personal email search often directly concatenate dense numerical features with embedded sparse features (e.g, n-gram embeddings). In this paper, we first show with a set of experiments on synthetic datasets that direct concatenation of dense and sparse features does not lead to the optimal search performance of deep neural ranking models. To effectively incorporate both sparse and dense email features into personal email search ranking, we propose a novel neural model, sepattn. sepattn first builds two separate neural models to learn from sparse and dense features respectively, and then applies an attention mechanism at the prediction level to derive the final prediction from these two models. We conduct a comprehensive set of experiments on a large-scale email search dataset, and demonstrate that our sepattn model consistently improves the search quality over the baseline models. View details
    Parameter Tuning in Personal Search Systems
    Suming Jeremiah Chen
    13th ACM International Conference on Web Search and Data Mining (WSDM) (2020)
    Preview abstract Retrieval effectiveness in information retrieval systems is heavily dependent on how various parameters are tuned. One option to find these parameters is to run multiple online experiments and using a parameter sweep approach in order to optimize the search system. There are multiple downsides of this approach, mainly that it may lead to a poor experience for users. Another option is to do offline evaluation, which can act as a safeguard against potential quality issues. Offline evaluation requires a validation set of data that can be benchmarked against different parameter settings. However, for search over personal corpora, e.g. email and file search, it is impractical and often impossible to get a complete representative validation set, due to the inability to save raw queries and document information. In this work, we show how to do offline parameter tuning with only a partial validation set. In addition, we demonstrate how to do parameter tuning in the cases when we have complete knowledge of the internal implementation of the search system (white-box tuning), as well as the case where we have only partial knowledge (grey-box tuning). This has allowed us to do offline parameter tuning in a privacy-sensitive manner. View details
    Attribute-based Propensity for Unbiased Learning in Recommender Systems: Algorithm and Case Studies
    Suming Jeremiah Chen
    Yongwoo Noh
    Jingzheng Qin
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2020)
    Preview abstract Many modern recommender systems train their models based on a large amount of implicit user feedback data. Due to the inherent bias in this data (e.g., position bias), learning from it directly can lead to suboptimal models. Recently, unbiased learning was proposed to address such problems by leveraging counterfactual techniques like inverse propensity weighting (IPW). In these methods, propensity scores estimation is usually limited to item's display position in a single user interface (UI). In this paper, we generalize the traditional position bias model to an attribute-based propensity framework. Our methods estimate propensity scores based on offline data and allow propensity estimation across a broad range of implicit feedback scenarios, e.g., feedback beyond recommender system UI. We demonstrate this by applying this framework to three real-world large-scale recommender systems in Google Drive that serve millions of users. For each system, we conduct both offline and online evaluation. Our results show that the proposed framework is able to significantly improve upon strong production baselines across a diverse range of recommendation item types (documents, people-document pairs, and queries), UI layouts (horizontal, vertical, and grid layouts), and underlying learning algorithms (gradient boosted decision trees and neural networks), all without the need to intervene and degrade the user experience. The proposed models have been deployed in the production systems with ease since no serving infrastructure change is needed. View details
    Preview abstract Work in information retrieval has traditionally been focused on ranking and relevance: for a user's query, fetch some number of results, ordered by relevance to the user. However, the problem of determining how many results to return, i.e. how to optimally truncate the ranked result list, has received far less attention despite being of critical importance in a range of applications. Such truncation is a balancing act between the overall relevance, or usefulness, of the results with the user cost of processing more results. In this work, we propose Choppy, an assumption-free model based on the widely successful Transformer architecture in NLP, to the ranked-list truncation problem. Needing nothing more than the relevance scores of the results, the model uses a powerful multi-head attention mechanism to directly optimize any user-defined target IR metric. We show Choppy improves upon recent, state-of-the-art baselines on Robust04. View details
    Preview abstract Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation. We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions. Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training. We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach. View details
    Multitask Mixture of Sequential Experts for User Activity Streams
    Yicheng Cheng
    Jingzheng Qin
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2020)
    Preview abstract Multi-task deep learning has been an actively researched topic, and it has been used in many real-world systems for user activities and content recommendation. While most of the multi-task model architectures proposed to date focus on using non-sequential input features (e,g. query and context), input data is often sequential in real-world web application scenarios. For example, user behavior streams, such as user search logs in search systems, are naturally a temporal sequence. Modeling user sequential behaviors as explicit sequential representations can empower the multi-task model to incorporate temporal dependencies, thus predicting future user behavior more accurately. In this work, we study the challenging problem of how to model sequential user behavior in the neural multi-task learning settings. Our major contribution is a novel framework, Mixture of Sequential Experts (MoSE). It explicitly models sequential user behavior using Long Short-Term Memory (LSTM) in the state-of-art Multi-gate Mixture-of-Expert multi-task modeling framework. In experiments, we show the effectiveness of the MoSE architecture over seven alternative architectures on both synthetic and noisy real-world user data in Google Apps. We also demonstrate the effectiveness and flexibility of the MoSE architecture in a real-world decision making engine in GMail, by trading off between search quality and resource costs. View details
    Improving Recommendation Quality at Google Drive
    Suming Jeremiah Chen
    Zachary Teal Wilson
    Brian Lee Calaci
    Ryan Lee Evans
    Sean Robert Abraham
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2020)
    Preview abstract Quick Access is a machine-learned system in Google Drive that predicts which files a user wants to open. Adding Quick Access recommendations to the Drive homepage cut the amount of time that users spend locating their files in half. Aggregated over the ~1 billion users of Drive, the time saved up adds up to ~1000 work weeks every day. In this paper, we discuss both the challenges of iteratively improving the quality of a personal recommendation system as well as the variety of approaches that we took in order to improve this feature. We explored different deep network architectures, novel modeling techniques, additional data sources, and the effects of latency and biases in the UX. We share both pitfalls as well as successes in our attempts to improve this product, and also discuss how we scaled and managed the complexity of the system. We believe that these insights will be especially useful to those who are working with private corpora as well as those who are building a large-scale production recommendation system. View details
    Preview abstract This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models. Concretely, the study of artifacts that emerge in machine generated text as a result of modeling choices is a nascent research area. To this end, the extent and degree to which these artifacts surface in generated text is still unclear. In the spirit of better understanding generative text models and their artifacts, we propose the new task of distinguishing which of several variants of a given model generated some piece of text. Specifically, we conduct an extensive suite of diagnostic tests to observe whether modeling choices (e.g., sampling methods, top-$k$ probabilities, model architectures, etc.) leave detectable artifacts in the text they generate. Our key finding, which is backed by a rigorous set of experiments, is that such artifacts are present and that different modeling choices can be inferred by looking at generated text alone. This suggests that neural text generators may actually be more sensitive to various modeling choices than previously thought. View details
    Multi-view Embedding-based Synonyms for Personal Search
    Hongbo Deng
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 575-584
    Preview abstract Synonym expansion is a technique that adds related words to search queries, which may lead to more relevant documents being retrieved, thus improving recall. There is extensive prior work on synonym expansion for web search, however very few studies have tackled its application for email search. Synonym expansion for private corpora like emails poses several unique research challenges. First, the emails are not shared across users, which precludes us from directly employing query-document bipartite graphs, which are standard in web search synonym expansion. Second, user search queries are of personal nature, and may not be generalizable across users. Third, the size of the underlying corpora from which the synonyms may be mined is relatively small (i.e., user's private email inbox) compared to the size of the web corpus. Therefore, in this paper, we propose a solution tailored to the challenges of synonym expansion for email search. We formulate it as a multi-view learning problem, and propose a novel embedding-based model that joins information from multiple sources to obtain the optimal synonym candidates. To demonstrate the effectiveness of the proposed technique, we evaluate our model using both explicit human ratings as well as a live experiment using the Gmail Search service, one of the world's largest email search engines. View details
    Preview abstract Spell correction is a must-have feature for any modern search engine in applications such as web or e-commerce search. Typical spell correction solutions used in production systems consist of large indexed lookup tables based on a global model trained across many users over a large scale web corpus or a query log. For search over personal corpora, such as email, this global solution is not sufficient, as it ignores the user's personal lexicon. Without personalization, global spelling fails to correct tail queries drawn from a user's own, often idiosyncratic, lexicon. Personalization using existing algorithms is difficult due to resource constraints and unavailability of sufficient data to build per-user models. In this work, we propose a simple and effective personalized spell correction solution that augments existing global solutions for search over private corpora. Our event driven spell correction candidate generation method is specifically designed with personalization as the key construct. Our novel spell correction and query completion algorithms do not require complex model training and is highly efficient. The proposed solution has shown over 30% click-through rate gain on affected queries when evaluated against a range of strong commercial personal search baselines - Google's Gmail, Drive, and Calendar search production systems. View details
    Domain Adaptation for Enterprise Email Search
    Brandon Tran
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2019)
    Preview abstract In the enterprise email search setting, the same search engine often powers multiple enterprises from various industries: technology, education, manufacturing, etc. However, using the same global ranking model across different enterprises may result in suboptimal search quality, due to the corpora differences and distinct information needs. On the other hand, training an individual ranking model for each enterprise may be infeasible, especially for smaller institutions with limited data. To address this data challenge, in this paper we propose a domain adaptation approach that fine-tunes the global model to each individual enterprise. In particular, we propose a novel application of the Maximum Mean Discrepancy (MMD) approach to information retrieval, which attempts to bridge the gap between the global data distribution and the distribution arising from an individual enterprise. We conduct a comprehensive set of experiments on a large-scale email search engine, and demonstrate that the MMD approach consistently improves the search quality for multiple individual domains, both in comparison to the global ranking model, as well as several competitive domain adaptation baselines including adversarial learning methods. View details
    Revisiting Online Personal Search Metrics with the User in Mind
    Azin Ashkan
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019)
    Preview abstract Traditional online quality metrics are based on search and browsing signals, such as position and time of the click. Such metrics typically model all users' behavior in exactly the same manner. Modeling individuals' behavior in Web search may be challenging as the user's historical behavior may not always be available (e.g., if the user is not signed into a given service). However, in personal search, individual users issue queries over their personal corpus (e.g. emails, files, etc.) while they are logged into the service. This brings an opportunity to calibrate online quality metrics with respect to an individual's search habits. With this goal in mind, the current paper focuses on a user-centric evaluation framework for personal search by taking into account variability of search and browsing behavior across individuals. The main idea is to calibrate each interaction of a user with respect to their historical behavior and search habits. To formalize this, a characterization of online metrics is proposed according to the relevance signal of interest and how the signal contributes to the computation of the gain in a metric. The proposed framework introduces a variant of online metrics called pMetrics (short for personalized metrics) that are based on the average search habits of users for the relevance signal of interest. Through extensive online experiments on a large population of Gmail search users, we show that pMetrics are effective in terms of their sensitivity, robustness, and stability compared to their standard variants as well as baselines with different normalization factors. View details
    Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search
    Pan Li
    25TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2019)
    Preview abstract Decision Trees (DTs) like LambdaMART have been one of the most effective types of learning-to-rank algorithms in the past decade. They typically work well with hand-crafted dense features (e.g., BM25 scores). Recently, Neural Networks (NNs) have shown impressive results in leveraging sparse and complex features (e.g., query and document keywords) directly when a large amount of training data is available. While there is a large chunk of work on how to use NNs for semantic matching between queries and documents, relatively less work has been conducted to compare NNs with DTs for general learning-to-rank tasks, where dense features are also available and DTs can achieve state-of-the-art performance. In this paper, we study how to combine DTs and NNs to effectively bring the benefits from both sides in the learning-to- rank setting. Specifically, we focus our study on personal search where clicks are used as the primary labels with unbiased learning- to-rank algorithms and a significantly large amount of training data is easily available. Our combination methods are based on ensemble learning. We design 12 variants and compare them based on two aspects, ranking effectiveness and ease-of-deployment, using two of the largest personal search services: Gmail search and Google Drive search. We show that direct application of existing ensemble methods can not achieve both aspects. We thus design a novel method that uses NNs to compensate DTs via boosting. We show that such a method is not only easier to deploy, but also gives comparable or better ranking accuracy. View details
    Multi-Task Learning for Personal Search Ranking with Query Clustering
    Jiaming Shen
    Proceedings of ACM Conference on Information and Knowledge Management (CIKM) (2018)
    Preview abstract User needs vary significantly across different tasks, and therefore their queries will also vary significantly in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. To obtain query types, these studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in personal search scenarios. Therefore, in this paper, we study the problem of how to obtain query type in an unsupervised fashion and how to leverage this information using query-dependent ranking models in personal search. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we propose three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that is trained to simultaneously rank documents and predict query types. We evaluate our ranking models using the click data collected from one of the world’s largest personal search engines. The experiments demonstrate that the proposed multi-task model can significantly outperform the baseline neural models, which either do not incorporate query type information or just simply feed query type as an additional feature. To the best of our knowledge, this is the first successful application of query-dependent multi-task learning in personal search ranking. View details
    Position Bias Estimation for Unbiased Learning to Rank in Personal Search
    Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2018), pp. 610-618
    Preview abstract A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional click models aim to extract the (query, document) relevance and the estimated bias is usually discarded after relevance is extracted. In contrast, the most recent work on unbiased learning-to-rank can effectively leverage the bias and thus focuses on estimating bias rather than relevance. Existing approaches use search result randomization over a small percentage of production traffic to estimate the position bias. This is not desired because result randomization can negatively impact users' search experience. In this paper, we compare different schemes for result randomization (i.e., RandTopN and RandPair) and show their negative effect in personal search. Then we study how to infer such bias from regular click data without relying on randomization. We propose a regression-based Expectation-Maximization (EM) algorithm that is based on a position bias click model and that can handle highly sparse clicks in personal search. We evaluate our EM algorithm and the extracted bias in the learning-to-rank setting. Our results show that it is promising to extract position bias from regular clicks without result randomization. The extracted bias can improve the learning-to-rank algorithms significantly. In addition, we compare the pointwise and pairwise learning-to-rank models. Our results show that pairwise models are more effective in leveraging the estimated bias. View details
    Learning with Sparse and Biased Feedback for Personal Search
    Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI) (2018), pp. 5219-5223
    Preview abstract Personal search, including email, on-device, and personal media search, has recently attracted a considerable attention from the information retrieval community. In this paper, we provide an overview of challenges and opportunities of learning with implicit user feedback (e.g., click data) in personal search. Implicit user feedback provides a convenient source of supervision for ranking models in personal search. This feedback, however, has two major drawbacks: it is highly sparse and biased due to the personal nature of queries and documents. We demonstrate how these drawbacks can be overcome, and empirically demonstrate the benefits of learning with implicit feedback in the context of a large-scale email search engine. View details
    Learning from User Interactions in Personal Search via Attribute Parameterization
    Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2017), pp. 791-800
    Preview abstract User interaction data (e.g., click data) has proven to be a powerful signal for learning-to-rank models in web search. However, such models require observing multiple interactions across many users for the same query-document pair to achieve statistically meaningful gains. Therefore, utilizing user interaction data for improving search over personal, rather than public, content is a challenging problem. First, the documents (e.g., emails or private files) are not shared across users. Second, user search queries are of personal nature (e.g., [alice's address]) and may not generalize well across users. In this paper, we propose a solution to these challenges, by projecting user queries and documents into a multi-dimensional space of fine-grained and semantically coherent attributes. We then introduce a novel parameterization technique to overcome sparsity in the multi-dimensional attribute space. Attribute parameterization enables effective usage of cross-user interactions for improving personal search quality -- which is a first such published result, to the best of our knowledge. Experiments with a dataset derived from interactions of users of one of the worlds' largest personal search engines demonstrate the effectiveness of the proposed attribute parameterization technique. View details
    Learning to Rank with Selection Bias in Personal Search
    Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2016), pp. 115-124
    Preview abstract Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past, many click models have been proposed and successfully used to estimate the relevance for individual query-document pairs in the context of web search. These click models typically require a large quantity of clicks for each individual pair and this makes them difficult to apply in systems where click data is highly sparse due to personalized corpora and information needs, e.g., personal search. In this paper, we study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework. This paper proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data. We empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines. View details
    Proceedings of the 2013 Conference on the Theory of Information Retrieval
    Oren Kurland
    Christina Lioma
    Birger Larsen
    Peter Ingwersen
    ACM (2013)
    Preview abstract These proceedings contain the refereed papers, posters and abstracts of keynotes, tutorials and panel discussion presented at the Fourth International Conference on the Theory of Information Retrieval (ICTIR13), held in Copenhagen, Denmark, during September 29-October 2, 2013. View details
    Effective query formulation with multiple information sources
    W. Bruce Croft
    WSDM (2012), pp. 443-452
    Structured Event Retrieval over Microblog Archives
    Congxing Cai
    Eduard H. Hovy
    HLT-NAACL (2012), pp. 646-655
    Data integration from open internet sources to combat sex trafficking of minors
    Hao Wang
    Congxing Cai
    Andrew Philpot
    Mark Latonero
    Eduard H. Hovy
    DG.O (2012), pp. 246-252
    Cross-corpus relevance projection
    Nima Asadi 0001
    Jimmy J. Lin
    SIGIR (2011), pp. 1163-1164
    Pseudo test collections for learning web search ranking functions
    Nima Asadi 0001
    Tamer Elsayed
    Jimmy J. Lin
    SIGIR (2011), pp. 1073-1082
    When close enough is good enough: approximate positional indexes for efficient ranked retrieval
    Tamer Elsayed
    Jimmy J. Lin
    CIKM (2011), pp. 1993-1996
    An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
    Eduard H. Hovy
    Chunliang Zhang
    ACL (Short Papers) (2011), pp. 546-551
    A cascade ranking model for efficient ranked retrieval
    Lidan Wang
    Jimmy J. Lin
    SIGIR (2011), pp. 105-114
    Parameterized concept weighting in verbose queries
    W. Bruce Croft
    SIGIR (2011), pp. 605-614
    Exploiting site-level information to improve web search
    Evgeniy Gabrilovich
    Vanja Josifovski
    George Mavromatis
    Jane Wang
    CIKM (2010), pp. 1393-1396
    Relevance and ranking in online dating systems
    Fernando Diaz
    Sihem Amer-Yahia
    SIGIR (2010), pp. 66-73
    UMD and USC/ISI: TREC 2010 Web Track Experiments with Ivory
    Tamer Elsayed
    Nima Asadi 0001
    Lidan Wang
    Jimmy J. Lin
    TREC (2010)
    Improved latent concept expansion using hierarchical markov random fields
    Hao Lang
    Bin Wang
    Jin-Tao Li
    CIKM (2010), pp. 249-258
    Measuring the reusability of test collections
    Ben Carterette
    Evgeniy Gabrilovich
    Vanja Josifovski
    WSDM (2010), pp. 231-240
    The anatomy of an ad: structured indexing and retrieval for sponsored search
    Evgeniy Gabrilovich
    Vanja Josifovski
    WWW (2010), pp. 101-110
    Learning concept importance using a weighted dependence model
    W. Bruce Croft
    WSDM (2010), pp. 31-40
    Ranking under temporal constraints
    Lidan Wang
    Jimmy J. Lin
    CIKM (2010), pp. 79-88
    Learning to efficiently rank
    Lidan Wang
    Jimmy J. Lin
    SIGIR (2010), pp. 138-145
    Building enriched document representations using aggregated anchor text
    Jasmine Novak
    Hang Cui
    Srihari Reddy
    SIGIR (2009), pp. 219-226
    Search Engine Adaptation by Feedback Control Adjustment for Time-sensitive Query
    Ruiqiang Zhang
    Yi Chang
    Zhaohui Zheng
    Jian-Yun Nie
    HLT-NAACL (Short Papers) (2009), pp. 165-168
    Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search
    Jimmy J. Lin
    Tamer Elsayed
    Lidan Wang
    TREC (2009)
    Semi-parametric and Non-parametric Term Weighting for Information Retrieval
    Hugo Zaragoza
    ICTIR (2009), pp. 42-53
    Improving search relevance for implicitly temporal queries
    Rosie Jones
    Fuchun Peng
    Ruiqiang Zhang
    SIGIR (2009), pp. 700-701
    Online expansion of rare queries for sponsored search
    Peter Ciccolo
    Evgeniy Gabrilovich
    Vanja Josifovski
    Lance Riedel
    Jeffrey Yuan
    WWW (2009), pp. 511-520
    Beyond bags of words: effectively modeling dependence and features in information retrieval
    SIGIR Forum, vol. 42 (2008), pp. 77
    To swing or not to swing: learning when (not) to advertise
    Marcus Fontoura
    Evgeniy Gabrilovich
    Vanja Josifovski
    Vanessa Murdock
    Vassilis Plachouras
    CIKM (2008), pp. 1003-1012
    Generalized inverse document frequency
    CIKM (2008), pp. 399-408
    A Statistical View of Binned Retrieval Models
    W. Bruce Croft
    ECIR (2008), pp. 175-186
    Linear feature-based models for information retrieval
    W. Bruce Croft
    Inf. Retr., vol. 10 (2007), pp. 257-274
    Latent concept expansion using markov random fields
    W. Bruce Croft
    SIGIR (2007), pp. 311-318
    Using gradient descent to optimize language modeling smoothing parameters
    SIGIR (2007), pp. 687-688
    Pseudo-Aligned Multilingual Corpora
    Fernando Diaz
    IJCAI (2007), pp. 2727-2732
    Similarity Measures for Short Segments of Text
    Susan T. Dumais
    Christopher Meek
    ECIR (2007), pp. 16-27
    CIIR Experiments for TREC Legal 2007
    Howard R. Turtle
    TREC (2007)
    Automatic feature selection in the markov random field model for information retrieval
    CIKM (2007), pp. 253-262
    Indri TREC Notebook 2006: Lessons Learned From Three Terabyte Tracks
    W. Bruce Croft
    TREC (2006)
    Estimation, sensitivity, and generalization in parameterized retrieval models
    CIKM (2006), pp. 812-813
    Beyond Bags of Words: Modeling Implicit User Preferences in Information Retrieval
    W. Bruce Croft
    AAAI (2006), pp. 1646-1649
    Improving the estimation of relevance models using large external corpora
    Fernando Diaz
    SIGIR (2006), pp. 154-161
    UMass Robust 2005: Using Mixtures of Relevance Models for Query Expansion
    Fernando Diaz
    W. Bruce Croft
    TREC (2005)
    A Markov random field model for term dependencies
    W. Bruce Croft
    SIGIR (2005), pp. 472-479
    Similarity measures for tracking information flow
    Yaniv Bernstein
    W. Bruce Croft
    Alistair Moffat
    Justin Zobel
    CIKM (2005), pp. 517-524
    The recap system for identifying information flow
    Yaniv Bernstein
    W. Bruce Croft
    Alistair Moffat
    Justin Zobel
    SIGIR (2005), pp. 678
    Analysis of Statistical Question Classification for Fact-Based Questions
    W. Bruce Croft
    Inf. Retr., vol. 8 (2005), pp. 481-504
    Indri at TREC 2005: Terabyte Track
    Yun Zhou
    W. Bruce Croft
    TREC (2005)
    An Inference Network Approach to Image Retrieval
    R. Manmatha
    CIVR (2004), pp. 42-50
    Indri at TREC 2004: Terabyte Track
    Howard R. Turtle
    W. Bruce Croft
    TREC (2004)
    Combining the language model and inference network approaches to retrieval
    W. Bruce Croft
    Inf. Process. Manage., vol. 40 (2004), pp. 735-750
    Formal multiple-bernoulli models for language modeling
    Victor Lavrenko
    W. Bruce Croft
    SIGIR (2004), pp. 540-541