Jiaming Shen
Jiaming Shen is a Senior Research Scientist in Google DeepMind, working on Natural Language Processing and Data Mining. For complete list of publications and latest updates, please check out his primary homepage at mickeysjm.github.io or visit his Google Scholar page.
Authored Publications
Sort By
LiPO: Listwise Preference Optimization through Learning-to-Rank
Misha Khalman
Yao Zhao
Jialu Liu
Peter Liu
NAACL (2025)
Preview abstract
Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment, with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-𝜆, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-𝜆 can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.
View details
RMBoost: Reward Model Training With Preference-Conditional Multi-Aspect Synthetic Data Generation
Yennie Jun
Carl Yang
Michael Bendersky
Ran Xu
2025
Preview abstract
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. The core idea of RMBoost is to first select a preference label and then directly generates the second more (or less) preferred response conditioned this preference label. Compared to traditional approaches where we first generate two responses and then obtain the preference label, RMBoost has two main advantages. First, RMBoost reduces labeling noise since preference pairs are constructed intentionally. Second, RMBoost allows for the creation of more diverse responses by incorporating various quality aspects (e.g., helpfulness, relevance, completeness) into the prompts We conduct extensive experiments on three diverse datasets and demonstrate that RMBoost outperforms other synthetic preference data generation techniques and significantly boosts the performance of five distinct reward models.
View details
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Jialu Liu
Michael Bendersky
Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2024)
Preview abstract
Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.
View details
Preference Distillation: Distilling Large Language Models with Teacher-Student Preference Pairs
Rongzhi Zhang
Feng Han
Chao Zhang
Michael Bendersky
Haorui Wang
Jialu Liu
2024
Preview abstract
Large Language Models (LLMs) have exhibited impressive capabilities across diverse range of tasks, yet their enormous parameter spaces present challenges in resource-constrained environments. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. Traditional methods like KL divergence can falter due to the student model's restricted expressivity, and relying solely on single teacher outputs may result in poorly calibrated student models. In this work, we propose a novel LLM distillation approach that leverages teacher-student preference pairs, steering the student's focus towards understanding the relative quality of outputs instead of merely replicating teacher outputs. This method offers dual benefits: it addresses the limitations of student model expressivity and improves sequence ranking calibration, thereby facilitating a more efficient knowledge transfer from teacher to student models. Extensive experiments on sequence generation tasks validate the effectiveness of our approach.
View details
Knowledge Distillation with Perturbed Loss: From a Vanilla Teacher to a Proxy Teacher
Rongzhi Zhang
Jialu Liu
Michael Bendersky
Chao Zhang
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024), ACM, pp. 4278 - 4289
Preview abstract
Knowledge distillation is a popular technique to transfer knowledge from a large teacher model to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution. Therefore, forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss implicitly transforms the original teacher into a proxy teacher with a distribution closer to the ground truth distribution. We establish the theoretical connection between this "distribution closeness'' and the student model generalizability, which enables us to select the PTLoss's perturbation coefficients in a principled way. Extensive experiments on six public benchmark datasets demonstrate the effectiveness of PTLoss with teachers of different scales.
View details
“Why is this misleading?”: Detecting News Headline Hallucinations with Explanations
Jialu Liu
Dan Finnie
Negar Rahmati
Mike Bendersky
Proceedings of the ACM Web Conference 2023 (WWW 2023)
Preview abstract
Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled "article, headline" pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations.
View details
Local Boosting for Weakly-supervised Learning
Rongzhi Zhang
Yue Yu
Xiquan Cui
Chao Zhang
Proc. of 29th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2023)
Preview abstract
Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose LocalBoost, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.
View details
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
Boshi Wang
Sewon Min
Xiang Deng
Luke Zettlemoyer
Huan Sun
Proc. of The 61st Annual Meeting of the Association for Computational Linguistics (2023)
Preview abstract
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs). CoT explicitly encourages the LLM to generate intermediate rationales for solving a problem, by providing a series of reasoning steps in the demonstrations. Despite its success, there is still little understanding of what makes CoT prompting effective and which aspects of the demonstrated reasoning steps contribute to its performance. In this paper, we show that CoT reasoning is possible even with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference. Further experiments show that other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning. Overall, these findings both deepen our understanding of CoT prompting, and open up new questions regarding LLMs' capability to learn to reason in context.
View details
Unsupervised Event Chain Mining from Multiple Documents
Yizhu Jiao
Ming Zhong
Yunyi Zhang
Chao Zhang
Jiawei Han
Proceedings of the ACM Web Conference 2023
Preview abstract
Massive and fast-evolving news articles keep emerging on the web. To effectively summarize and provide concise insights into real-world events, we propose a new event knowledge extraction task Event Chain Mining in this paper. Given multiple documents about a super event, it aims to mine a series of salient events in temporal order. For example, the event chain of super event "Mexico Earthquake in 2017" is {"earthquake hit Mexico", "destroy houses", "kill people", "block roads"}. This task can help readers capture the gist of texts quickly, thereby improving reading efficiency and deepening text comprehension. To address this task, we regard an event as a cluster of different mentions of similar meanings. In this way, we can identify the different expressions of events, enrich their semantic knowledge and replenish relation information among them. Taking events as the basic unit, we present a novel unsupervised framework, EMiner. Specifically, we extract event mentions from texts and merge them with similar meanings into a cluster as a single event. By jointly incorporating both content and commonsense, essential events are then selected and arranged chronologically to form an event chain. Meanwhile, we annotate a multi-document benchmark to build a comprehensive testbed for the proposed task. Extensive experiments are conducted to verify the effectiveness of EMiner in terms of both automatic and human evaluations.
View details
Cold-Start Data Selection for Better Few-shot Language Model Fine-tuning: A Prompt-based Uncertainty Propagation Approach
Yue Yu
Rongzhi Zhang
Ran Xu
Jieyu Zhang
Chao Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) (2023)
Preview abstract
Large Language Models have demonstrated remarkable few-shot performance, but the performance can be sensitive to the selection of few-shot instances. We propose PATRON, a new method that uses prompt-based uncertainty estimation for data selection for pre-trained language model fine-tuning under cold-start scenarios, i.e., no initial labeled data are available. In PATRON, we design (1) a prompt-based uncertainty propagation approach to estimate the importance of data points and (2) a partition-then-rewrite (PTR) strategy to promote sample diversity when querying for annotations. Experiments on six text classification datasets show that PATRON outperforms the strongest cold-start data selection baselines by up to 6.9%. Besides, with 128 labels only, PATRON achieves 91.0% and 92.1% of the fully supervised performance based on vanilla fine-tuning and prompt-based learning respectively.
View details