CURIE: Evaluating LLMs on multitask long context scientific understanding and reasoning

Michael Brenner

Matthew Abraham

Haining Pan

Peter Norgaard

Zahra Shamsi

Muqthar Mohammad

Chenfei Jiang

Ruth Alcantara

Gowoon Cheon

Sameera Ponda

Xuejian Ma

Michael Statt

Jackson Cui

Dan Morris

Martyna Plomecka

Nayantara Mudur

Eun-Ah Kim

Paul Raccuglia

Victor V. Albert

Lizzie Dorfman

Brian Rohr

Shutong Li

Maria Tikhanovskaya

Viren Jain

Drew Purves

Elise Kleeman

Yasaman Bahri

Philippe Faist

Subhashini Venugopalan

Ean Phing VanLee

International Conference on Learning Representations (ICLR) (2025)

Download Google Scholar

Abstract

The core of the scientific problem-solving process involves synthesizing information while applying expert knowledge. Large Language Models (LLMs) have the potential to accelerate this process due to their extensive knowledge across a variety of domains. Recent advancements have also made it possible for LLMs to handle very long "in-context" content. However, existing evaluations of long-context LLMs have focused on assessing their ability to summarize or retrieve information within the given context, primarily in generalist tasks that do not require deep scientific expertise. To facilitate analogous assessments of domain-specific tasks, we introduce the scientific long-Context Understanding and Reasoning Inference Evaluations (CURIE) benchmark. This benchmark provides a set of 8 challenging tasks, derived from around 250 scientific research papers, requiring domain expertise, comprehension of long in-context information, and multi-step reasoning that tests the ability of LLMs to assist scientists in realistic workflows. Tasks in CURIE have been collected from experts in six disciplines - materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and protein sequencing - covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on these tasks. Additionally, we propose strategies for task decomposition, which allow for a more nuanced evaluation of the models and facilitate staged multi-step assessments. We hope that insights gained from CURIE can guide the future development of LLMs.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

CURIE: Evaluating LLMs on multitask long context scientific understanding and reasoning

Abstract

Learn more about how we conduct our research