
Hamid Palangi
Thanks for visiting my page, please refer to www.hamidpalangi.com for more information.
Authored Publications
Sort By
Model Swarms: Collaborative Search of Adapted LLM Experts via Swarm Intelligence
Shangbin Feng
Yike Wang
Nathalie Rauschmayr
Yejin Choi
Yulia Tsvetkov
ICML 2025
Preview abstract
We propose Model Swarms, a collaborative search algorithm to adapt LLM experts via swarm intelligence. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and search for adapted models that optimize the utility function. Compared to existing model composition approaches, Model Swarms offers modularity, works in low-data regimes, and doesn't need assumptions about existing experts and how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single dataset, multi-dataset domains, reward models, as well as diverse human preferences. Further analysis reveals that LLM experts discover previously unseen capabilities in the search process and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.
View details
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialog Agents
Zhen Tan
George Lee
Anand Iyer
Tianlong Chen
Huan Liu
ACL 2025
Preview abstract
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
View details
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
Shangbin Feng
Yike Wang
Weijia Shi
Huang Xia
Luke Zettlemoyer
Yulia Tsvetkov
NeurIPS 2025
Preview abstract
We propose Heterogeneous Swarms, an algorithm to discover and adapt multi-LLM systems by jointly optimizing model roles and weights. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as input-output relationships and optimize the directed acyclic graph (DAG) of LLMs representing a multi-LLM system. Starting from a swarm of randomly initialized continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order with message passing, evaluate on the utility function, and optimize the adjacency matrices with swarm intelligence based on the utility score. For weight-step, we define JFK-score to evaluate the contribution of individual LLMs in the best-found DAG of the role-step, then optimize model weights with swarm intelligence based on the JFK-score. Extensive experiments demonstrate that Heterogeneous Swarms outperforms 15 baselines spanning role-based and weight-based approaches by 18.5% on average across 12 tasks and contexts. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of initial LLMs.
View details
Preview abstract
Evaluation of instruction following capabilities for multi-modal, multi-turn chat is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show that LLM based judges are biased towards answers from the same model. We propose a new evaluation set, MMMT-IF, an image based multi-turn Q\&A task with added global instructions between questions, constraining the format of the answers. This reveals limitations of current models for following multiple instructions and is challenging as the models need to first retrieve multiple instructions spread out in the long chat history, and then reason over them to answer image based questions with instruction constraints. All the instructions and constraints are program verifiable, i.e., verifying them is objective. We propose a set of metrics referred to as Programmatic Instruction Following (PIF) to measure the fraction of the instructions that are correctly followed while performing a reasoning task, and PIF-TOP-N-K, to measure the fraction of time at least K out of N sampled model responses achieve PIF score of one. This is our most challenging metric, targeting both instruction following and robustness. We show that our proposed approach for evaluation of instruction following with the PIF metric is also aligned with ratings from humans, with over 70 percent correlation. Our experiments show that the models studied in this work, Gemini 1.5 Pro, GPT-4o, and Claude Sonnet 3.5, have a PIF metric that significantly deteriorate for long chats, highlighting an area with a significant headroom for improvement. Across all chat turns when each response is repeated 4 times (PIF-TOP-4-4), GPT-4o and Gemini are only able to successfully follow all instructions 11 percent of the time. When in addition to have instructions dispersed throughout the model input context, all the instructions are also added in the end of the model input context, we see an average 22.3 point improvement in the PIF metric, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions from the model context.
View details