Jump to Content
Kai Hui

Kai Hui

Kai is a Senior Software Engineer at Google AI, working on IR and NLP. Prior to that, he worked at Amazon Alexa and SAP. Kai took his Ph.D. from Max-Planck Institute for Informatics in Germany, where he worked on neural IR models and IR evaluation. His research interests focus on the developments of deep learning models for ad-hoc information retrieval and question answering. Kai co-authored peer-reviewed research papers and serves as program committee members, editorial board members, and reviewers for IR/NLP conferences and journals. For complete list of his publications, please check out his homepage or visit his Google Scholar page.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide range of tasks and datasets make it difficult to assess or invigorate advances in this field. This paper first examines representative prior arts on ranking distillation, and raises three questions to be answered around methodology and reproducibility. To that end, we propose a systematic and unified benchmark, Ranking Distillation Suite (RD-Suite), which is a suite of tasks with 4 large realworld datasets, encompassing two major modalities (textual and numeric) and two applications (standard distillation and distillation transfer). RD-Suite consists of benchmark results that challenge some of the common wisdom in the field, and the release of datasets with teacher scores and evaluation scripts for future research. RD-Suite paves the way towards better understanding of ranking distillation, facilities more research in this direction, and presents new challenges. View details
    Learning List-Level Domain-Invariant Representations for Ranking
    Ruicheng Xian
    Hamed Zamani
    Han Zhao
    37th Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking. View details
    Preview abstract Popularized by the Differentiable Search Index, the emerging paradigm of Generative Retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus into the parameters of a single transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task consisting of 8.8M passages. After ablating for the most promising techniques, we then consider model scales up to 11B parameters. Along the way, we uncover several findings about scaling generative retrieval to millions of passages. Notably, the use of synthetic query generation as document representation is the only modeling technique critical to retrieval effectiveness. In addition, we find that the strongest performing architecture modifications from the literature at T5-Base initialization only perform well due to added parameters. Naively scaling to a comparable model size outperforms these proposed techniques. Finally, while model scale is necessary as corpus size increases, we find that given existing techniques, scaling model parameters past a certain point can be detrimental for retrieval effectiveness. This result might be counter-intuitive to the commonly held belief that model capacity is a limiting factor for scaling generative retrieval to larger corpora, and suggests the need for more fundamental improvements. In general, we believe that these findings will be highly valuable for the community to clarify the state of generative retrieval at scale and highlight the challenges currently facing the paradigm. View details
    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
    Jianmo Ni
    Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2023)
    Preview abstract Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with "pairwise" or "listwise" ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses. View details
    Preview abstract Recent work has shown that Large Language Models (LLMs) can effectively re-rank the outputs of BM25 retrieval. This is achieved zero-shot by including task-specific instructions. However, for tasks that require scoring instead of generation, few-shot prompting remains underexplored. In this work, we improve LLM-based re-ranking performance by including demonstrations in the prompt. We show that adding even a single demonstration makes a significant impact. Our detailed analysis investigates under which conditions demonstrations are the most helpful. We propose a novel difficulty-based demonstration selection strategy instead of using the commonly used approach of semantic similarity. Furthermore, we show that demonstrations helpful for ranking are also effective at question generation. We hope our research will facilitate further studies into both question generation and passage re-ranking. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Preview abstract Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training. View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    No Results Found