Chen-Yu Lee

Chen-Yu Lee

Chen-Yu Lee is a research scientist at Google, where he works on machine learning and its real-world applications across various tasks and modalities. Previously, he spent two years at Apple, where he published the Technology Development Group's inaugural research paper at CVPR and launched several key features in ARKit (now Vision Pro). He received his PhD from UC San Diego, advised by Prof. Zhuowen Tu.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We propose Heterogeneous Swarms, an algorithm to discover and adapt multi-LLM systems by jointly optimizing model roles and weights. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as input-output relationships and optimize the directed acyclic graph (DAG) of LLMs representing a multi-LLM system. Starting from a swarm of randomly initialized continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order with message passing, evaluate on the utility function, and optimize the adjacency matrices with swarm intelligence based on the utility score. For weight-step, we define JFK-score to evaluate the contribution of individual LLMs in the best-found DAG of the role-step, then optimize model weights with swarm intelligence based on the JFK-score. Extensive experiments demonstrate that Heterogeneous Swarms outperforms 15 baselines spanning role-based and weight-based approaches by 18.5% on average across 12 tasks and contexts. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of initial LLMs. View details
    Deep Researcher with Test-time Diffusion
    Guan Sun
    Zoey CuiZhu
    Yuanjun (Sophia) Bi
    Weiming Wen
    Hui Wan
    Chunfeng Wen
    Solène Maître
    George Lee
    Vishy Tirumalashetty
    Emily Xue
    Burak Gokturk
    2025
    Preview abstract Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design guides the report writing process to be more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents. View details
    Preview abstract Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset. View details
    Preview abstract We propose Model Swarms, a collaborative search algorithm to adapt LLM experts via swarm intelligence. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and search for adapted models that optimize the utility function. Compared to existing model composition approaches, Model Swarms offers modularity, works in low-data regimes, and doesn't need assumptions about existing experts and how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single dataset, multi-dataset domains, reward models, as well as diverse human preferences. Further analysis reveals that LLM experts discover previously unseen capabilities in the search process and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process. View details
    Preview abstract Recent knowledge distillation (KD) research made significant progress on improving smaller student models to match larger teachers' performances. Two noticeable methods, supervised KD and on-policy KD emerged as the state-of-the-art approaches. However, supervised KD for auto-regressive models suffers from distribution mismatch between training over fixed dataset and inference over student generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples and the teacher's potential inaccuracies in assessing these samples. To address these limitations, we introduce Speculative Knowledge Distillation (SKD). Instead of solely training on teacher- or student-proposed samples, SKD leverages the student model to initially propose tokens following its own generation distribution. Subsequently, the teacher model is employed to replace tokens that are deemed out-of-distribution. Compared with supervised KD, the samples generated by SKD are more likely to align with the student's inference-time distribution, and 2) SKD can mitigate the generation of low-quality sequences by incorporating the teacher's feedback at each token. Furthermore, we demonstrate that SKD is a generic framework capable of implementing both supervised and on-policy knowledge distillation as specific instances. To validate SKD's effectiveness, we apply it to distill autoregressive large language models for various tasks, including translation, summarization, math, and instruction following. Our experiments consistently demonstrate SKD's superior performance compared to existing methods across different domains, tasks, data sizes, and model initialization strategies. View details
    Preview abstract Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies View details
    Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
    Zilong Wang
    Steven Zheng
    Swaroop Mishra
    Yuwei Zhang
    Anush Mattapalli
    Ankur Taly
    Jingbo Shang
    ICLR 2025
    Preview abstract Retrieval augmented generation (RAG) has attracted a lot of attention across both academia and industry due to its capability in inserting timely and accurate evidence to the generation by large language models. However, the introduction of retrieved evidence largely makes the input prompt longer, which would harm the understanding quality of large language models and make it slower in actual usage scenarios. To solve these issues, we propose SpeculativeRAG, which leverages a smaller LLM to conduct the retrieval augmented generation for a larger LLM. The smaller LLM can digest a few pieces of evidence and generate multiple pieces of drafts in parallel rapidly, and these drafts will be verified by a large LLM to guarantee the quality. We achieve a higher speed as well as a better quality in the RAG results. View details
    LMDX: Language Model-based Document Information Extraction And Localization
    Kai Kang
    Florian Luisier
    Xiaoyu Sun
    Ramya Sree Boppana
    Zilong Wang
    Jiaqi Mu
    Hao Zhang
    Nan Hua
    Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, pp. 15140-15168
    Preview abstract Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce Language Model-based Document Information EXtraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers. View details
    Found in the middle: Calibrating Positional Attention Bias Improves Long Context Utilization
    Cheng-Yu Hsieh
    Yung-Sung Chuang
    Chun-Liang Li
    Abhishek Kumar
    James Glass
    Alexander Ratner
    Ranjay Krishna
    2024
    Preview abstract Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 15 percentage points. These findings open up future directions in understanding LLM attention bias and its potential consequences. View details
    Preview abstract Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning. View details