Zhen Qin

Zhen Qin

Zhen Qin is currently a staff machine learning engineer at Google DeepMind. He is working in the Research team leading several efforts on state-of-art large-scale search & recommendations algorithms and systems. He has published over 50 research papers, with >10 first author papers in top venues including PAMI, TIP, ICLR, NeurIPS, UAI, CVPR, KDD, WWW, RecSys, and other papers in ICML, WSDM, ACL, SIGIR, AAAI. His work has been transferred to Google.com, Youtube, GSuite (Gmail, Drive, Calendar), Ads, Chrome Web Store, Google Cloud AI, and Google Assistant. He received his B.E. degree from Beijing University of Posts and Telecommunications in 2010 and Ph.D. from University of California, Riverside 2015.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?
    Minghan Li
    Jimmy Lin
    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24)(2024)
    Preview abstract Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, crossencoder rankers remains under-explored. A recent study shows that current expansion techniques benefit weaker models but harm stronger rankers. In this paper, we re-examine this conclusion and raise the following question: Can query expansion improve generalization of strong cross-encoder rankers? To answer this question, we first apply popular query expansion methods to different crossencoder rankers and verify the deteriorated zero-shot effectiveness. We identify two vital steps in the experiment: high-quality keyword generation and minimally-disruptive query modification. We show that it is possible to improve the generalization of a strong neural ranker, by generating keywords through a reasoning chain and aggregating the ranking results of each expanded query via selfconsistency, reciprocal rank weighting, and fusion. Experiments on BEIR and TREC Deep Learning 2019/2020 show that the nDCG@10 scores of both MonoT5 and RankT5 following these steps are improved, which points out a direction for applying query expansion to strong cross-encoder rankers. View details
    Preview abstract Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like "Yes" and "No". However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers. View details
    Preview abstract Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. View details
    Preview abstract The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide range of tasks and datasets make it difficult to assess or invigorate advances in this field. This paper first examines representative prior arts on ranking distillation, and raises three questions to be answered around methodology and reproducibility. To that end, we propose a systematic and unified benchmark, Ranking Distillation Suite (RD-Suite), which is a suite of tasks with 4 large realworld datasets, encompassing two major modalities (textual and numeric) and two applications (standard distillation and distillation transfer). RD-Suite consists of benchmark results that challenge some of the common wisdom in the field, and the release of datasets with teacher scores and evaluation scripts for future research. RD-Suite paves the way towards better understanding of ranking distillation, facilities more research in this direction, and presents new challenges. View details
    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
    Jianmo Ni
    Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)(2023)
    Preview abstract Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with "pairwise" or "listwise" ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses. View details
    Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance
    Pratyush Kar
    Bing-Rong Lin
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management(2023)
    Preview abstract As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily compatible, which makes the trade-off less ideal for either of them. In this paper, we propose a practical regression compatible ranking (RCR) approach that achieves a better trade-off, where the two ranking and regression components are proved to be mutually aligned. Although the same idea applies to ranking with both binary and graded relevance, we mainly focus on binary labels in this paper. We evaluate the proposed approach on several public LTR benchmarks and show that it consistently achieves either best or competitive result in terms of both regression and ranking metrics, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy. The proposed approach has been successfully deployed in the YouTube production system. View details
    Learning List-Level Domain-Invariant Representations for Ranking
    Ruicheng Xian
    Hamed Zamani
    Han Zhao
    37th Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking. View details
    Preview abstract Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect. View details
    Preview abstract Recent work has shown that Large Language Models (LLMs) can effectively re-rank the outputs of BM25 retrieval. This is achieved zero-shot by including task-specific instructions. However, for tasks that require scoring instead of generation, few-shot prompting remains underexplored. In this work, we improve LLM-based re-ranking performance by including demonstrations in the prompt. We show that adding even a single demonstration makes a significant impact. Our detailed analysis investigates under which conditions demonstrations are the most helpful. We propose a novel difficulty-based demonstration selection strategy instead of using the commonly used approach of semantic similarity. Furthermore, we show that demonstrations helpful for ranking are also effective at question generation. We hope our research will facilitate further studies into both question generation and passage re-ranking. View details
    Preview abstract State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end. View details
    Preview abstract Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. Given an instance, an MCC model computes a score for each class, all of which are used to sort the classes. The performance of a model is usually measured by Top-K Accuracy/Error (e.g. K=1 or 5). In this paper, we do not aim to propose new neural network architectures as most recent works do, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as \emph{an instance class ranking problem}, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the existing Top-K metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking pipeline. Based on such generalization, we show that it is intuitive to leverage techniques from the rich information retrieval literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    On the Value of Prior in Online Learning to Rank
    Branislav Kveton
    Ofer Meshi
    The 25th International Conference on Artificial Intelligence and Statistics(2022)
    Preview abstract This paper addresses the cold-start problem in online learning to rank (OLTR). We show both theoretically and empirically that priors improve the quality of ranked lists presented to users interactively based on user feedback. These priors can come in the form of unbiased estimates of the relevance of the ranked items, or more practically, can be obtained from offline-learned models. Our experiments show the effectiveness of priors in improving the short-term regret of tabular OLTR algorithms, based on Thompson sampling and BayesUCB. View details
    Preview abstract We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over the teacher rankers without increasing model capacity. The key success factors of SDR, which differs from common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the existing knowledge distillation literature. Building upon the state-of-the-art neural ranking structure, SDR is able to push the limits of neural ranking performance above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers, and provide ablation studies to verify the necessity of the key factors in the SDR framework. View details
    On Optimizing Top-K Metrics for Neural Ranking Models
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(2022), 2303–2307
    Preview abstract Top-K metrics such as NDCG@K are frequently used to evaluate ranking performance. The traditional tree-based models such as LambdaMART, which are based on Gradient Boosted Decision Trees (GBDT), are designed to optimize NDCG@K using the LambdaRank losses. Recently, there is a good amount of research interest on neural ranking models for learning-to-rank tasks. These models are fundamentally different from the decision tree models and behave differently with respect to different loss functions. For example, the most popular ranking losses used in neural models are the Softmax loss and the GumbelApproxNDCG loss. These losses do not connect to top-K metrics such as NDCG@K naturally. It remains a question on how to effectively optimize NDCG@K for neural ranking models. In this paper, we follow the LambdaLoss framework and design novel and theoretically sound losses for NDCG@K metrics, while the original LambdaLoss paper can only do so using an unsound heuristic. We study the new losses on the LETOR benchmark datasets and show that the new losses work better than other losses for neural ranking models. View details
    Scale Calibration of Deep Ranking Models
    28TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)(2022), pp. 4300-4309
    Preview abstract Learning-to-Rank (LTR) systems are ubiquitous in web applications nowadays. The existing literature mainly focuses on improving ranking performance by trying to generate the optimal order of candidate items. However, virtually all advanced ranking functions are not scale calibrated. For example, rankers have the freedom to add a constant to all item scores without changing their relative order. This property has resulted in several limitations in deploying advanced ranking methods in practice. On the one hand, it limits the use of effective ranking functions in important applications. For example, in ads ranking, predicted Click-Through Rate (pCTR) is used for ranking and is required to be calibrated for the downstream ads auction. This is a major reason that existing ads ranking methods use scale calibrated pointwise loss functions that may sacrifice ranking performance. On the other hand, popular ranking losses are translation-invariant. We rigorously show that, both theoretically and empirically, this property leads to training instability that may cause severe practical issues. In this paper, we study how to perform scale calibration of deep ranking models to address the above concerns. We design three different formulations to calibrate ranking models through calibrated ranking losses. Unlike existing post-processing methods, our calibration is performed during training, which can resolve the training instability issue without any additional processing. We conduct experiments on the standard LTR benchmark datasets and one of the largest sponsored search ads dataset from Google. Our results show that our proposed calibrated ranking losses can achieve nearly optimal results in terms of both ranking quality and score scale calibration. View details
    Rax: Composable Learning-to-Rank using JAX
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2022), 3051–3060
    Preview abstract Rax is a library for composable Learning-to-Rank (LTR) written entirely in JAX. The goal of Rax is to facilitate easy prototyping of LTR systems by leveraging the flexibility and simplicity of JAX. Rax provides a diverse set of popular ranking metrics and losses that integrate well with the rest of the JAX ecosystem. Furthermore, Rax implements a system of ranking-specific function transformations which allows fine-grained customization of ranking losses and metrics. Most notably Rax provides approx_t12n: a function transformation (t12n) that can transform any of our ranking metrics into an approximate and differentiable form that can be optimized. This provides a systematic way to directly optimize neural ranking models for ranking metrics that are not easily optimizable in other libraries. We empirically demonstrate the effectiveness of Rax by benchmarking neural models implemented using Flax and trained using Rax on two popular LTR benchmarks: WEB30K and Istella. Furthermore, we show that integrating ranking losses with T5, a large language model, can improve overall ranking performance on the MS MARCO passage ranking task. We are sharing the Rax library with the open source community as part of the larger JAX ecosystem at https://github.com/google/rax. View details
    Revisiting two tower models for unbiased learning to rank
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(2022), 2410–2414
    Preview abstract Two-tower architecture (one tower to factorize out position-related bias) has now become a common technique in neural network ranking models for Unbiased Learning To Rank (ULTR). In these models, a neural network tower taking in all position related features is designed to model the biases, which are equivalent to the propensity scores used to define the unbiased ranking metrics. It works based on the assumptions that the user interaction (click) is conditioned on the user observation of a ranked item, and only the observation probability depends on the position. So if we factorize out the observation probability, we can then unbiased rank the items by their click rate conditioned on observation. The assumption appears sensible, and the additive two-tower models based on it have been widely implemented in ULTR. However, two-tower models may not always work and sometimes work even worse than the biased models, as the user may not always follow the same pattern. In this work, we stick to the plausible assumption about the user interaction, but we also consider the spectrum of different user behaviors. In this case, the assumption that the position related observation probability may not be able to get explicitly factorized out. We also study generic methods to treat this complexity and show these methods could outperform the simple additive debias models in offline experiments. View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Preview abstract In the era of pretrained language models, transformers are the defacto choice of model architectures. While recent works have shown promise in entirely convolutional based architectures, these CNN-based models have not been widely adopted or evaluated under the pretrain-finetune paradigm. In the context of language models, are convolutional models competitive when pretrained? This paper investigates this research question and presents several interesting findings. Across a set of extensive experiments, our findings show that CNN-based pretrained models are highly competitive and outperform Transformer-based pretrained models in certain scenarios, albeit with caveats. Overall, the findings of this paper should implore the broader academic community to perhaps not conflate pretraining advances with architectural advances and both set of techniques could be studied in isolation. View details
    Improving Cloud Storage Search with User Activity
    Proceedings of the 14th International Conference on Web Search and Data Mining (WSDM '21), ACM(2021)
    Preview abstract Cloud-based file storage platforms such as Google Drive are widely used as a means for storing, editing and sharing personal and organizational documents. In this paper, we improve search ranking quality for cloud storage platforms by utilizing user activity logs. Different from search logs, activity logs capture general document usage activity beyond search, such as opening, editing and sharing documents. We propose to automatically learn text embeddings that are effective for search ranking from activity logs. We develop a novel co-access signal, i.e., whether two documents were accessed by a user around the same time, to train deep semantic matching models that are useful for improving the search ranking quality. We confirm that activity-trained semantic matching models can improve ranking by conducting extensive offline experimentation using Google Drive search and activity logs. To the best of our knowledge, this is the first work to examine the benefits of leveraging document usage activity at large scale for cloud storage search; as such it can shed light on using such activity in scenarios where direct collection of search-specific interactions (e.g., query and click logs) may be expensive or infeasible. View details
    Preview abstract Despite the success of neural models in many major machine learning problems and recently published neural learning to rank (LTR) papers in top venues, the effectiveness of neural models on traditional LTR problems is still not widely acknowledged. We first validate the concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available tree-based implementation, which is sometimes ignored in recent neural LTR papers. We then investigate why existing neural LTR suffers by identifying several of its weaknesses. To that end, we propose a new neural LTR framework that mitigates these weaknesses, by borrowing ideas from several research fields. Our models are able to perform comparatively with the strong tree-based baseline, while outperforming recently published neural learning to rank methods by a large margin. Our results also serve as a benchmark for neural learning to rank models. View details
    Non-Clicks Mean Irrelevant? Propensity Ratio Scoring As a Correction
    Nan Wang
    Hongning Wang
    14th ACM International Conference on Web Search and Data Mining (WSDM)(2021)
    Preview abstract Recent advances in unbiased learning to rank (LTR) count on Inverse Propensity Scoring (IPS) to eliminate bias in implicit feedback. Though theoretically sound in correcting the bias introduced by treating clicked documents as relevant, IPS ignores the bias caused by (implicitly) treating non-clicked ones as irrelevant. In this work, we first rigorously prove that such use of click data leads to unnecessary pairwise comparisons between relevant documents, which prevent unbiased ranker optimization. Based on the proof, we derive a simple yet well justified new weighting scheme, called Propensity Ratio Scoring (PRS), which provides treatments on both clicks and non-clicks. Besides correcting the bias in clicks, PRS avoids relevant-relevant document comparisons in LTR training and enjoys a lower variability. Our extensive empirical evaluations confirm that PRS ensures a more effective use of click data and improved performance in both synthetic data from a set of LTR benchmarks, as well as in the real-world large-scale data from GMail search. View details
    Preview abstract Existing work on search result diversification typically falls into the "next document" paradigm, that is, selecting the next document based on the ones already chosen. A sequential process of selecting documents one-by-one is naturally modeled in learning-based approaches. However, such a process makes the learning difficult because there are an exponential number of ranking lists to consider. Sampling is usually used to reduce the computational complexity but this makes the learning less effective. In this paper, we propose a soft version of the "next document" paradigm in which we associate each document with an approximate rank, and thus the subtopics covered prior to a document can also be estimated. We show that we can derive differentiable diversification-aware losses, which are smooth approximation of diversity metrics like alpha-NDCG, based on these estimates. We further propose to optimize the losses in the learning-to-rank setting using neural distributed representations of queries and documents. Experiments are conducted on the public benchmark TREC datasets. By comparing with an extensive list of baseline methods, we show that our Diversification-Aware LEarning-TO-Rank (DALETOR) approaches outperform them by a large margin, while being much simpler during learning and inference. View details
    Preview abstract A well-known challenge in leveraging implicit user feedback like clicks to improve real-world search services and recommender systems is its inherent bias. Most existing click models are based on the examination hypothesis in user behaviors and differ in how to model such an examination bias. However, they are constrained by assuming a simple position-based bias or enforcing a sequential order in user examination behaviors. These assumptions are insufficient to capture complex real-world user behaviors and hardly generalize to modern user interfaces (UI) in web applications (e.g., results shown in a grid view). In this work, we propose a fully data-driven neural model for the examination bias, Cross-Positional Attention (XPA), which is more flexible in fitting complex user behaviors. Our model leverages the attention mechanism to effectively capture cross-positional interactions among displayed items and is applicable to arbitrary UIs. We employ XPA in a novel neural click model that can both predict clicks and estimate relevance. Our experiments on offline synthetic data sets show that XPA is robust among different click generation processes. We further apply XPA to a large-scale real-world recommender system, showing significantly better results than baselines in online A/B experiments that involve millions of users. This validates the necessity to model more complex user behaviors than those proposed in the literature. View details
    Preview abstract We describe how we built three recommendation products from scratch at Google Chrome Web Store, namely context-based recommendations, related extension recommendations, and personalized recommendations. Unlike most existing papers that focus on novel algorithms, this paper focuses on sharing practical experiences building large scale recommender systems under various real-world constraints, such as privacy constraints, data sparsity issues, highly skewed data distribution, and product design choices, such as user interface. We show how these constraints make standard approaches difficult to succeed in practice. We share success stories that turn very negative live metrics to very positive, by introducing 1) how we use interpretable neural models to bootstrap the systems, helps identifying pipeline issues, and paves way for more advanced models. 2) A new item-item based recommendation algorithm that works under highly skewed data distributions, and 3) how two products can help bootstrapping the third one, which significantly reduces development cycles and bypasses various real-world difficulties. All the explorations in this work are verified in live traffic on millions of users. We believe the findings in this work can help practitioners to bootstrap and build large-scale recommender systems. View details
    Ensemble Distillation for BERT-Based Ranking Models
    Shuguang Han
    Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’21)
    Preview abstract Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models. View details
    Preview abstract This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena. View details
    Preview abstract Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation. We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions. Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training. We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach. View details
    Multitask Mixture of Sequential Experts for User Activity Streams
    Yicheng Cheng
    Zhe Zhao
    Shirley Zhe Chen
    Jingzheng Qin
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)(2020)
    Preview abstract Multi-task deep learning has been an actively researched topic, and it has been used in many real-world systems for user activities and content recommendation. While most of the multi-task model architectures proposed to date focus on using non-sequential input features (e,g. query and context), input data is often sequential in real-world web application scenarios. For example, user behavior streams, such as user search logs in search systems, are naturally a temporal sequence. Modeling user sequential behaviors as explicit sequential representations can empower the multi-task model to incorporate temporal dependencies, thus predicting future user behavior more accurately. In this work, we study the challenging problem of how to model sequential user behavior in the neural multi-task learning settings. Our major contribution is a novel framework, Mixture of Sequential Experts (MoSE). It explicitly models sequential user behavior using Long Short-Term Memory (LSTM) in the state-of-art Multi-gate Mixture-of-Expert multi-task modeling framework. In experiments, we show the effectiveness of the MoSE architecture over seven alternative architectures on both synthetic and noisy real-world user data in Google Apps. We also demonstrate the effectiveness and flexibility of the MoSE architecture in a real-world decision making engine in GMail, by trading off between search quality and resource costs. View details
    Improving Recommendation Quality at Google Drive
    Suming Jeremiah Chen
    Zachary Teal Wilson
    Brian Lee Calaci
    Ryan Lee Evans
    Sean Robert Abraham
    Mike Colagrosso
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)(2020)
    Preview abstract Quick Access is a machine-learned system in Google Drive that predicts which files a user wants to open. Adding Quick Access recommendations to the Drive homepage cut the amount of time that users spend locating their files in half. Aggregated over the ~1 billion users of Drive, the time saved up adds up to ~1000 work weeks every day. In this paper, we discuss both the challenges of iteratively improving the quality of a personal recommendation system as well as the variety of approaches that we took in order to improve this feature. We explored different deep network architectures, novel modeling techniques, additional data sources, and the effects of latency and biases in the UX. We share both pitfalls as well as successes in our attempts to improve this product, and also discuss how we scaled and managed the complexity of the system. We believe that these insights will be especially useful to those who are working with private corpora as well as those who are building a large-scale production recommendation system. View details
    Attribute-based Propensity for Unbiased Learning in Recommender Systems: Algorithm and Case Studies
    Suming Jeremiah Chen
    Yongwoo Noh
    Jingzheng Qin
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)(2020)
    Preview abstract Many modern recommender systems train their models based on a large amount of implicit user feedback data. Due to the inherent bias in this data (e.g., position bias), learning from it directly can lead to suboptimal models. Recently, unbiased learning was proposed to address such problems by leveraging counterfactual techniques like inverse propensity weighting (IPW). In these methods, propensity scores estimation is usually limited to item's display position in a single user interface (UI). In this paper, we generalize the traditional position bias model to an attribute-based propensity framework. Our methods estimate propensity scores based on offline data and allow propensity estimation across a broad range of implicit feedback scenarios, e.g., feedback beyond recommender system UI. We demonstrate this by applying this framework to three real-world large-scale recommender systems in Google Drive that serve millions of users. For each system, we conduct both offline and online evaluation. Our results show that the proposed framework is able to significantly improve upon strong production baselines across a diverse range of recommendation item types (documents, people-document pairs, and queries), UI layouts (horizontal, vertical, and grid layouts), and underlying learning algorithms (gradient boosted decision trees and neural networks), all without the need to intervene and degrade the user experience. The proposed models have been deployed in the production systems with ease since no serving infrastructure change is needed. View details
    Training Machine Learning Models With Causal Logic
    Ang Li
    Suming Jeremiah Chen
    Jingzheng Qin
    IID Workshop at the 2020 World Wide Web Conference(2020)
    Preview abstract Machine-learning (ML) models are ubiquitously used to make a variety of inferences, a common application being to predict and categorize user behavior. However, ML models often suffer from only being exposed to biased data -- for instance, a search ranking model that uses clicks to determine how to rank will suffer from position bias. The difficulty arises due to user feedback only being observed for one treatment and not existing counterfactually for other potential treatments. In this work, we discuss a real-world situation in which a binary classification model is used in production in order to make decisions about how to treat users. We introduce the model as well as the limitations of our modeling approach, and show that by using counterfactual selection criterion we can improve upon the current modeling process and do a better job classifying users. Following, we propose a causal modeling method in which we can take the existing data and use it to derive bounds that can be used for objective function modification in order to incorporate counterfactual learning into our training process. View details
    Parameter Tuning in Personal Search Systems
    Suming Jeremiah Chen
    13th ACM International Conference on Web Search and Data Mining (WSDM)(2020)
    Preview abstract Retrieval effectiveness in information retrieval systems is heavily dependent on how various parameters are tuned. One option to find these parameters is to run multiple online experiments and using a parameter sweep approach in order to optimize the search system. There are multiple downsides of this approach, mainly that it may lead to a poor experience for users. Another option is to do offline evaluation, which can act as a safeguard against potential quality issues. Offline evaluation requires a validation set of data that can be benchmarked against different parameter settings. However, for search over personal corpora, e.g. email and file search, it is impractical and often impossible to get a complete representative validation set, due to the inability to save raw queries and document information. In this work, we show how to do offline parameter tuning with only a partial validation set. In addition, we demonstrate how to do parameter tuning in the cases when we have complete knowledge of the internal implementation of the search system (white-box tuning), as well as the case where we have only partial knowledge (grey-box tuning). This has allowed us to do offline parameter tuning in a privacy-sensitive manner. View details
    Stabilizing Neural Search Ranking Models
    Ruilin Li
    Suming Jeremiah Chen
    The Web Conference 2020 (WWW)
    Preview abstract Neural search ranking models have been not only actively studied in academic research, but also widely adopted in real-world industrial applications. However, due to the high non-convexity and stochastic nature of neural model formulations, the obtained models are unstable in the sense that model predictions can vary a lot for two models trained with the same configuration. In practice, new features are continuously introduced and new model architectures are explored to improve model effectiveness. In these cases, the instability of neural models lead to unnecessary document ranking changes for a large portion of queries. Such changes not only lead to inconsistent user experience, but also add noise to online experimentation and can slow down model improvement cycles. How to stabilize neural search ranking models during model update is an important but largely unexplored problem. Motivated by trigger analysis, we suggest to balance the trade-off between performance improvement and the amount of affected queries. Concretely, we formulate it as an optimization problem with the objective as maximizing the average effect over the affected queries. We propose two heuristics and one theory-guided stabilization methods to solve the optimization problem. Our proposed methods are evaluated on two of the world's largest personal search services: Gmail search and Google Drive search. Empirical results show that our proposed methods are very effective in optimizing the proposed objective and are applicable to different model update scenarios. View details
    Preview abstract Spell correction is a must-have feature for any modern search engine in applications such as web or e-commerce search. Typical spell correction solutions used in production systems consist of large indexed lookup tables based on a global model trained across many users over a large scale web corpus or a query log. For search over personal corpora, such as email, this global solution is not sufficient, as it ignores the user's personal lexicon. Without personalization, global spelling fails to correct tail queries drawn from a user's own, often idiosyncratic, lexicon. Personalization using existing algorithms is difficult due to resource constraints and unavailability of sufficient data to build per-user models. In this work, we propose a simple and effective personalized spell correction solution that augments existing global solutions for search over private corpora. Our event driven spell correction candidate generation method is specifically designed with personalization as the key construct. Our novel spell correction and query completion algorithms do not require complex model training and is highly efficient. The proposed solution has shown over 30% click-through rate gain on affected queries when evaluated against a range of strong commercial personal search baselines - Google's Gmail, Drive, and Calendar search production systems. View details
    Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search
    Pan Li
    25TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)(2019)
    Preview abstract Decision Trees (DTs) like LambdaMART have been one of the most effective types of learning-to-rank algorithms in the past decade. They typically work well with hand-crafted dense features (e.g., BM25 scores). Recently, Neural Networks (NNs) have shown impressive results in leveraging sparse and complex features (e.g., query and document keywords) directly when a large amount of training data is available. While there is a large chunk of work on how to use NNs for semantic matching between queries and documents, relatively less work has been conducted to compare NNs with DTs for general learning-to-rank tasks, where dense features are also available and DTs can achieve state-of-the-art performance. In this paper, we study how to combine DTs and NNs to effectively bring the benefits from both sides in the learning-to- rank setting. Specifically, we focus our study on personal search where clicks are used as the primary labels with unbiased learning- to-rank algorithms and a significantly large amount of training data is easily available. Our combination methods are based on ensemble learning. We design 12 variants and compare them based on two aspects, ranking effectiveness and ease-of-deployment, using two of the largest personal search services: Gmail search and Google Drive search. We show that direct application of existing ensemble methods can not achieve both aspects. We thus design a novel method that uses NNs to compensate DTs via boosting. We show that such a method is not only easier to deploy, but also gives comparable or better ranking accuracy. View details
    Multi-Task Learning for Personal Search Ranking with Query Clustering
    Jiaming Shen
    Proceedings of ACM Conference on Information and Knowledge Management (CIKM)(2018)
    Preview abstract User needs vary significantly across different tasks, and therefore their queries will also vary significantly in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. To obtain query types, these studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in personal search scenarios. Therefore, in this paper, we study the problem of how to obtain query type in an unsupervised fashion and how to leverage this information using query-dependent ranking models in personal search. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we propose three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that is trained to simultaneously rank documents and predict query types. We evaluate our ranking models using the click data collected from one of the world’s largest personal search engines. The experiments demonstrate that the proposed multi-task model can significantly outperform the baseline neural models, which either do not incorporate query type information or just simply feed query type as an additional feature. To the best of our knowledge, this is the first successful application of query-dependent multi-task learning in personal search ranking. View details
    No Results Found