Hamid Palangi

Hamid Palangi

Thanks for visiting my page, please refer to www.hamidpalangi.com for more information.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Evaluation of instruction following capabilities for multi-modal, multi-turn chat is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show that LLM based judges are biased towards answers from the same model. We propose a new evaluation set, MMMT-IF, an image based multi-turn Q\&A task with added global instructions between questions, constraining the format of the answers. This reveals limitations of current models for following multiple instructions and is challenging as the models need to first retrieve multiple instructions spread out in the long chat history, and then reason over them to answer image based questions with instruction constraints. All the instructions and constraints are program verifiable, i.e., verifying them is objective. We propose a set of metrics referred to as Programmatic Instruction Following (PIF) to measure the fraction of the instructions that are correctly followed while performing a reasoning task, and PIF-TOP-N-K, to measure the fraction of time at least K out of N sampled model responses achieve PIF score of one. This is our most challenging metric, targeting both instruction following and robustness. We show that our proposed approach for evaluation of instruction following with the PIF metric is also aligned with ratings from humans, with over 70 percent correlation. Our experiments show that the models studied in this work, Gemini 1.5 Pro, GPT-4o, and Claude Sonnet 3.5, have a PIF metric that significantly deteriorate for long chats, highlighting an area with a significant headroom for improvement. Across all chat turns when each response is repeated 4 times (PIF-TOP-4-4), GPT-4o and Gemini are only able to successfully follow all instructions 11 percent of the time. When in addition to have instructions dispersed throughout the model input context, all the instructions are also added in the end of the model input context, we see an average 22.3 point improvement in the PIF metric, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions from the model context. View details
    Preview abstract We propose Model Swarms, a collaborative search algorithm to adapt LLM experts via swarm intelligence. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and search for adapted models that optimize the utility function. Compared to existing model composition approaches, Model Swarms offers modularity, works in low-data regimes, and doesn't need assumptions about existing experts and how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single dataset, multi-dataset domains, reward models, as well as diverse human preferences. Further analysis reveals that LLM experts discover previously unseen capabilities in the search process and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process. View details