MMMT-IF: A Challenging Multi-Modal Multi-Turn Instruction Following Foundation Model Benchmark
Abstract
Evaluation of instruction following capabilities for multi-modal, multi-turn chat is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show that LLM based judges are biased towards answers from the same model. We propose a new evaluation set, MMMT-IF, an image based multi-turn Q\&A task with added global instructions between questions, constraining the format of the answers. This reveals limitations of current models for following multiple instructions and is challenging as the models need to first retrieve multiple instructions spread out in the long chat history, and then reason over them to answer image based questions with instruction constraints. All the instructions and constraints are program verifiable, i.e., verifying them is objective. We propose a set of metrics referred to as Programmatic Instruction Following (PIF) to measure the fraction of the instructions that are correctly followed while performing a reasoning task, and PIF-TOP-N-K, to measure the fraction of time at least K out of N sampled model responses achieve PIF score of one. This is our most challenging metric, targeting both instruction following and robustness. We show that our proposed approach for evaluation of instruction following with the PIF metric is also aligned with ratings from humans, with over 70 percent correlation. Our experiments show that the models studied in this work, Gemini 1.5 Pro, GPT-4o, and Claude Sonnet 3.5, have a PIF metric that significantly deteriorate for long chats, highlighting an area with a significant headroom for improvement. Across all chat turns when each response is repeated 4 times (PIF-TOP-4-4), GPT-4o and Gemini are only able to successfully follow all instructions 11 percent of the time. When in addition to have instructions dispersed throughout the model input context, all the instructions are also added in the end of the model input context, we see an average 22.3 point improvement in the PIF metric, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions from the model context.