Jump to Content
Honglei Zhuang

Honglei Zhuang

For more information, visit Honglei Zhuang's homepage.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect. View details
    Preview abstract The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide range of tasks and datasets make it difficult to assess or invigorate advances in this field. This paper first examines representative prior arts on ranking distillation, and raises three questions to be answered around methodology and reproducibility. To that end, we propose a systematic and unified benchmark, Ranking Distillation Suite (RD-Suite), which is a suite of tasks with 4 large realworld datasets, encompassing two major modalities (textual and numeric) and two applications (standard distillation and distillation transfer). RD-Suite consists of benchmark results that challenge some of the common wisdom in the field, and the release of datasets with teacher scores and evaluation scripts for future research. RD-Suite paves the way towards better understanding of ranking distillation, facilities more research in this direction, and presents new challenges. View details
    Learning List-Level Domain-Invariant Representations for Ranking
    Ruicheng Xian
    Hamed Zamani
    Han Zhao
    37th Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking. View details
    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
    Jianmo Ni
    Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2023)
    Preview abstract Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with "pairwise" or "listwise" ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses. View details
    Preview abstract Recent work has shown that Large Language Models (LLMs) can effectively re-rank the outputs of BM25 retrieval. This is achieved zero-shot by including task-specific instructions. However, for tasks that require scoring instead of generation, few-shot prompting remains underexplored. In this work, we improve LLM-based re-ranking performance by including demonstrations in the prompt. We show that adding even a single demonstration makes a significant impact. Our detailed analysis investigates under which conditions demonstrations are the most helpful. We propose a novel difficulty-based demonstration selection strategy instead of using the commonly used approach of semantic similarity. Furthermore, we show that demonstrations helpful for ranking are also effective at question generation. We hope our research will facilitate further studies into both question generation and passage re-ranking. View details
    Preview abstract Popularized by the Differentiable Search Index, the emerging paradigm of Generative Retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus into the parameters of a single transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task consisting of 8.8M passages. After ablating for the most promising techniques, we then consider model scales up to 11B parameters. Along the way, we uncover several findings about scaling generative retrieval to millions of passages. Notably, the use of synthetic query generation as document representation is the only modeling technique critical to retrieval effectiveness. In addition, we find that the strongest performing architecture modifications from the literature at T5-Base initialization only perform well due to added parameters. Naively scaling to a comparable model size outperforms these proposed techniques. Finally, while model scale is necessary as corpus size increases, we find that given existing techniques, scaling model parameters past a certain point can be detrimental for retrieval effectiveness. This result might be counter-intuitive to the commonly held belief that model capacity is a limiting factor for scaling generative retrieval to larger corpora, and suggests the need for more fundamental improvements. In general, we believe that these findings will be highly valuable for the community to clarify the state of generative retrieval at scale and highlight the challenges currently facing the paradigm. View details
    Revisiting two tower models for unbiased learning to rank
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022), 2410–2414
    Preview abstract Two-tower architecture (one tower to factorize out position-related bias) has now become a common technique in neural network ranking models for Unbiased Learning To Rank (ULTR). In these models, a neural network tower taking in all position related features is designed to model the biases, which are equivalent to the propensity scores used to define the unbiased ranking metrics. It works based on the assumptions that the user interaction (click) is conditioned on the user observation of a ranked item, and only the observation probability depends on the position. So if we factorize out the observation probability, we can then unbiased rank the items by their click rate conditioned on observation. The assumption appears sensible, and the additive two-tower models based on it have been widely implemented in ULTR. However, two-tower models may not always work and sometimes work even worse than the biased models, as the user may not always follow the same pattern. In this work, we stick to the plausible assumption about the user interaction, but we also consider the spectrum of different user behaviors. In this case, the assumption that the position related observation probability may not be able to get explicitly factorized out. We also study generic methods to treat this complexity and show these methods could outperform the simple additive debias models in offline experiments. View details
    Preview abstract Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training. View details
    Preview abstract Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. Given an instance, an MCC model computes a score for each class, all of which are used to sort the classes. The performance of a model is usually measured by Top-K Accuracy/Error (e.g. K=1 or 5). In this paper, we do not aim to propose new neural network architectures as most recent works do, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as \emph{an instance class ranking problem}, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the existing Top-K metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking pipeline. Based on such generalization, we show that it is intuitive to leverage techniques from the rich information retrieval literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework. View details
    Preview abstract We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over the teacher rankers without increasing model capacity. The key success factors of SDR, which differs from common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the existing knowledge distillation literature. Building upon the state-of-the-art neural ranking structure, SDR is able to push the limits of neural ranking performance above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers, and provide ablation studies to verify the necessity of the key factors in the SDR framework. View details
    Rax: Composable Learning-to-Rank using JAX
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022), 3051–3060
    Preview abstract Rax is a library for composable Learning-to-Rank (LTR) written entirely in JAX. The goal of Rax is to facilitate easy prototyping of LTR systems by leveraging the flexibility and simplicity of JAX. Rax provides a diverse set of popular ranking metrics and losses that integrate well with the rest of the JAX ecosystem. Furthermore, Rax implements a system of ranking-specific function transformations which allows fine-grained customization of ranking losses and metrics. Most notably Rax provides approx_t12n: a function transformation (t12n) that can transform any of our ranking metrics into an approximate and differentiable form that can be optimized. This provides a systematic way to directly optimize neural ranking models for ranking metrics that are not easily optimizable in other libraries. We empirically demonstrate the effectiveness of Rax by benchmarking neural models implemented using Flax and trained using Rax on two popular LTR benchmarks: WEB30K and Istella. Furthermore, we show that integrating ranking losses with T5, a large language model, can improve overall ranking performance on the MS MARCO passage ranking task. We are sharing the Rax library with the open source community as part of the larger JAX ecosystem at https://github.com/google/rax. View details
    Stochastic Retrieval-Conditioned Reranking
    Hamed Zamani
    The ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR) 2022
    Preview abstract The multi-stage cascaded architecture has been adopted by many search engines for efficient and effective retrieval. This architecture consists of a stack of retrieval and reranking models in which efficient retrieval models are followed by effective (neural) learning to rank models. The optimization of these learning to rank models is loosely connected to the early stage retrieval models. In many cases these learning to rank models are often trained in isolation of the early stage retrieval models. This paper draws theoretical connections between the early stage retrieval and late stage reranking models by deriving expected reranking performance conditioned on the early stage retrieval results. Our findings shed light on optimization of both retrieval and reranking models. As a result, we also introduce a novel loss function for training reranking models that leads to significant improvement in multiple public benchmarks. View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Preview abstract Despite the success of neural models in many major machine learning problems and recently published neural learning to rank (LTR) papers in top venues, the effectiveness of neural models on traditional LTR problems is still not widely acknowledged. We first validate the concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available tree-based implementation, which is sometimes ignored in recent neural LTR papers. We then investigate why existing neural LTR suffers by identifying several of its weaknesses. To that end, we propose a new neural LTR framework that mitigates these weaknesses, by borrowing ideas from several research fields. Our models are able to perform comparatively with the strong tree-based baseline, while outperforming recently published neural learning to rank methods by a large margin. Our results also serve as a benchmark for neural learning to rank models. View details
    Preview abstract We describe how we built three recommendation products from scratch at Google Chrome Web Store, namely context-based recommendations, related extension recommendations, and personalized recommendations. Unlike most existing papers that focus on novel algorithms, this paper focuses on sharing practical experiences building large scale recommender systems under various real-world constraints, such as privacy constraints, data sparsity issues, highly skewed data distribution, and product design choices, such as user interface. We show how these constraints make standard approaches difficult to succeed in practice. We share success stories that turn very negative live metrics to very positive, by introducing 1) how we use interpretable neural models to bootstrap the systems, helps identifying pipeline issues, and paves way for more advanced models. 2) A new item-item based recommendation algorithm that works under highly skewed data distributions, and 3) how two products can help bootstrapping the third one, which significantly reduces development cycles and bypasses various real-world difficulties. All the explorations in this work are verified in live traffic on millions of users. We believe the findings in this work can help practitioners to bootstrap and build large-scale recommender systems. View details
    Ensemble Distillation for BERT-Based Ranking Models
    Shuguang Han
    Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’21)
    Preview abstract Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models. View details
    Preview abstract A well-known challenge in leveraging implicit user feedback like clicks to improve real-world search services and recommender systems is its inherent bias. Most existing click models are based on the examination hypothesis in user behaviors and differ in how to model such an examination bias. However, they are constrained by assuming a simple position-based bias or enforcing a sequential order in user examination behaviors. These assumptions are insufficient to capture complex real-world user behaviors and hardly generalize to modern user interfaces (UI) in web applications (e.g., results shown in a grid view). In this work, we propose a fully data-driven neural model for the examination bias, Cross-Positional Attention (XPA), which is more flexible in fitting complex user behaviors. Our model leverages the attention mechanism to effectively capture cross-positional interactions among displayed items and is applicable to arbitrary UIs. We employ XPA in a novel neural click model that can both predict clicks and estimate relevance. Our experiments on offline synthetic data sets show that XPA is robust among different click generation processes. We further apply XPA to a large-scale real-world recommender system, showing significantly better results than baselines in online A/B experiments that involve millions of users. This validates the necessity to model more complex user behaviors than those proposed in the literature. View details
    Interpretable Ranking with Generalized Additive Models
    Alexander Grushetsky
    Petr Mitrichev
    Ethan Sterling
    Nathan Bell
    Walker Ravina
    Hai Qian
    Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM) (2021)
    Preview abstract Interpretability of ranking models is a crucial yet relatively under-examined research area. Recent progress on this area largely focuses on generating post-hoc explanations for existing black-box ranking models. Though promising, such post-hoc methods cannot provide sufficiently accurate explanations in general, which makes them infeasible in many high-stakes scenarios, especially the ones with legal or policy constraints. Thus, building an intrinsically interpretable ranking model with transparent, self-explainable structure becomes necessary, but this remains less explored in the learning-to-rank setting. In this paper, we lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks. Generalized additive models (GAMs) are intrinsically interpretable machine learning models and have been extensively studied on regression and classification tasks. We study how to extend GAMs into ranking models which can handle both item-level and list-level features and propose a novel formulation of ranking GAMs. To instantiate ranking GAMs, we employ neural networks instead of traditional splines or regression trees. We also show that our neural ranking GAMs can be distilled into a set of simple and compact piece-wise linear functions that are much more efficient to evaluate with little accuracy loss. We conduct experiments on three data sets and show that our proposed neural ranking GAMs can outperform other traditional GAM baselines while maintaining similar interpretability. View details
    Preview abstract How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval research. The recent developments in deep learning show strength in modeling complex relationships across sequences and sets. It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework. In this paper, we formally define the permutation equivariance requirement for a scoring function that captures cross-document interactions. We then propose a self-attention based document interaction network that extends any univariate scoring function with contextual features capturing cross-document interactions. We show that it satisfies the permutation equivariance requirement, and can generate scores for document sets of varying sizes. Our proposed methods can automatically learn to capture document interactions without any auxiliary information, and can scale across large document sets. We conduct experiments on four ranking datasets: the public benchmarks WEB30K and Istella, as well as Gmail search and Google Drive Quick Access datasets. Experimental results show that our proposed methods lead to significant quality improvements over state-of-the-art neural ranking models, and are competitive with state-of-the-art gradient boosted decision tree (GBDT) based models on the WEB30K dataset. View details
    Feature Transformation for Neural Ranking Models
    Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pp. 1649-1652
    Preview abstract Although neural network models enjoy tremendous advantages in handling image and text data, tree-based models still remain competitive for learning-to-rank tasks with numerical data. A major strength of tree-based ranking models is the insensitivity to different feature scales, while neural ranking models may suffer from features with varying scales or skewed distributions. Feature transformation or normalization is a simple technique which preprocesses input features to mitigate their potential adverse impact on neural models. However, due to lack of studies, it is unclear to what extent feature transformation can benefit neural ranking models. In this paper, we aim to answer this question by providing empirical evidence for learning-to-rank tasks. First, we present a list of commonly used feature transformation techniques and perform a comparative study on multiple learning-to-rank data sets. Then we propose a mixture feature transformation mechanism which can automatically derive a mixture of basic feature transformation functions to achieve the optimal performance. Our experiments show that applying feature transformation can substantially improve the performance of neural ranking models compared to directly using the raw features. In addition, the proposed mixture transformation method can further improve the performance of the ranking model without any additional human effort. View details
    Separate And Attend in Personal Email Search
    Yu Meng
    Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM) (2020)
    Preview abstract In personal email search, user queries often impose different requirements on different aspects of the retrieved emails. For example, the query "my recent flight to the US'" requires emails to be ranked based on both textual contents and recency of the email documents, while other queries such as "medical history'" do not impose any constraints on the recency of the email. Recent deep learning-to-rank models for personal email search often directly concatenate dense numerical features with embedded sparse features (e.g, n-gram embeddings). In this paper, we first show with a set of experiments on synthetic datasets that direct concatenation of dense and sparse features does not lead to the optimal search performance of deep neural ranking models. To effectively incorporate both sparse and dense email features into personal email search ranking, we propose a novel neural model, sepattn. sepattn first builds two separate neural models to learn from sparse and dense features respectively, and then applies an attention mechanism at the prediction level to derive the final prediction from these two models. We conduct a comprehensive set of experiments on a large-scale email search dataset, and demonstrate that our sepattn model consistently improves the search quality over the baseline models. View details
    No Results Found