Expert evaluation of LLM world models: A high-Tc superconductivity case study

Haoyu Guo

Maria Tikhanovskaya

Paul Raccuglia

Alexey Vlaskin

Chris Co

Dan Liebling

Scott Ellsworth

Matthew Abraham

Lizzie Dorfman

Peter Armitage

Chunhan Feng

Antoine Georges

Olivier Gingras

Dominik Kiese

Steve Kivelson

Vadim Oganesyan

Brad Ramshaw

Subir Sachdev

Senthil Todadri

John Tranquada

Michael Brenner

Subhashini Venugopalan

Eun-Ah Kim

Proceedings of the National Academy of Sciences (2026)

Download Google Scholar

Abstract

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Expert evaluation of LLM world models: A high-Tc superconductivity case study

Abstract

Learn more about how we conduct our research