CURIE: Evaluating LLMs on multitask long context scientific understanding and reasoning

Matthew Abraham
Haining Pan
Zahra Shamsi
Muqthar Mohammad
Chenfei Jiang
Ruth Alcantara
Gowoon Cheon
Xuejian Ma
Michael Statt
Jackson Cui
Nayantara Mudur
Eun-Ah Kim
Paul Raccuglia
Victor V. Albert
Lizzie Dorfman
Brian Rohr
Shutong Li
Maria Tikhanovskaya
Drew Purves
Elise Kleeman
Philippe Faist
Ean Phing VanLee
International Conference on Learning Representations (ICLR) (2025)

Abstract

The core of the scientific problem-solving process involves synthesizing information while applying expert knowledge. Large Language Models (LLMs) have the potential to accelerate this process due to their extensive knowledge across a variety of domains. Recent advancements have also made it possible for LLMs to handle very long "in-context" content. However, existing evaluations of long-context LLMs have focused on assessing their ability to summarize or retrieve information within the given context, primarily in generalist tasks that do not require deep scientific expertise. To facilitate analogous assessments of domain-specific tasks, we introduce the scientific long-Context Understanding and Reasoning Inference Evaluations (CURIE) benchmark. This benchmark provides a set of 8 challenging tasks, derived from around 250 scientific research papers, requiring domain expertise, comprehension of long in-context information, and multi-step reasoning that tests the ability of LLMs to assist scientists in realistic workflows. Tasks in CURIE have been collected from experts in six disciplines - materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and protein sequencing - covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on these tasks. Additionally, we propose strategies for task decomposition, which allow for a more nuanced evaluation of the models and facilitate staged multi-step assessments. We hope that insights gained from CURIE can guide the future development of LLMs.
×