Michael P Brenner
Michael is an applied mathematician, interested in the interface between machine learning and science.
Research Areas
Authored Publications
Sort By
Expert evaluation of LLM world models: A high-Tc superconductivity case study
Haoyu Guo
Maria Tikhanovskaya
Paul Raccuglia
Alexey Vlaskin
Chris Co
Scott Ellsworth
Matthew Abraham
Lizzie Dorfman
Peter Armitage
Chunhan Feng
Antoine Georges
Olivier Gingras
Dominik Kiese
Steve Kivelson
Vadim Oganesyan
Brad Ramshaw
Subir Sachdev
Senthil Todadri
John Tranquada
Eun-Ah Kim
Proceedings of the National Academy of Sciences (2026)
Preview abstract
Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems.
View details
CURIE: Evaluating LLMs on multitask long context scientific understanding and reasoning
Matthew Abraham
Haining Pan
Zahra Shamsi
Muqthar Mohammad
Chenfei Jiang
Ruth Alcantara
Gowoon Cheon
Xuejian Ma
Michael Statt
Jackson Cui
Nayantara Mudur
Eun-Ah Kim
Paul Raccuglia
Victor V. Albert
Lizzie Dorfman
Brian Rohr
Shutong Li
Maria Tikhanovskaya
Drew Purves
Elise Kleeman
Philippe Faist
Ean Phing VanLee
International Conference on Learning Representations (ICLR) (2025)
Preview abstract
The core of the scientific problem-solving process involves synthesizing information while applying expert knowledge. Large Language Models (LLMs) have the potential to accelerate this process due to their extensive knowledge across a variety of domains. Recent advancements have also made it possible for LLMs to handle very long "in-context" content. However, existing evaluations of long-context LLMs have focused on assessing their ability to summarize or retrieve information within the given context, primarily in generalist tasks that do not require deep scientific expertise. To facilitate analogous assessments of domain-specific tasks, we introduce the scientific long-Context Understanding and Reasoning Inference Evaluations (CURIE) benchmark. This benchmark provides a set of 8 challenging tasks, derived from around 250 scientific research papers, requiring domain expertise, comprehension of long in-context information, and multi-step reasoning that tests the ability of LLMs to assist scientists in realistic workflows. Tasks in CURIE have been collected from experts in six disciplines - materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and protein sequencing - covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on these tasks. Additionally, we propose strategies for task decomposition, which allow for a more nuanced evaluation of the models and facilitate staged multi-step assessments. We hope that insights gained from CURIE can guide the future development of LLMs.
View details
Towards AI-assisted academic writing
Malcolm Kane
Madeleine Grunde-McLaughlin
Ian Lang
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, Association for Computational Linguistics (2025), pp. 31-45
Preview abstract
We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user’s current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.
View details
A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations
Uri Hasson
Samuel A. Nastase
Harshvardhan Gazula
Aditi Rao
Tom Sheffer
Werner Doyle
Orrin Devinsky
aditi singh
Adeen Flinker
Patricia Dugan
Bobbi Aubrey
Sasha Devore
Daniel Friedman
Leonard Niekerken
Catherine Kim
Haocheng Wang
Zaid Zada
Gina Choe
Nature Human Behaviour (2025)
Preview abstract
This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.
View details
Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns
Ariel Goldstein
Avigail Grinstein-Dabush
Haocheng Wang
Zhuoqiao Hong
Bobbi Aubrey
Samuel A. Nastase
Zaid Zada
Eric Ham
Harshvardhan Gazula
Eliav Buchnik
Werner Doyle
Sasha Devore
Patricia Dugan
Roi Reichart
Daniel Friedman
Orrin Devinsky
Adeen Flinker
Uri Hasson
Nature Communications (2024)
Preview abstract
Contextual embeddings, derived from deep language models (DLMs), provide
a continuous vectorial representation of language. This embedding space
differs fundamentally from the symbolic representations posited by traditional
psycholinguistics. We hypothesize that language areas in the human brain,
similar to DLMs, rely on a continuous embedding space to represent language.
To test this hypothesis, we densely record the neural activity patterns in the
inferior frontal gyrus (IFG) of three participants using dense intracranial arrays
while they listened to a 30-minute podcast. From these fine-grained spatiotemporal neural recordings, we derive a continuous vectorial representation
for each word (i.e., a brain embedding) in each patient. We demonstrate that
brain embeddings in the IFG and the DLM contextual embedding space have
common geometric patterns using stringent zero-shot mapping. The common
geometric patterns allow us to predict the brain embedding of a given left-out
word in IFG based solely on its geometrical relationship to other nonoverlapping words in the podcast. Furthermore, we show that contextual
embeddings better capture the geometry of IFG embeddings than static word
embeddings. The continuous brain embedding space exposes a vector-based
neural code for natural language processing in the human brain.
View details
FEABench: Evaluating language models on real world physics reasoning ability
Jackson Cui
Nayantara Mudur
Paul Raccuglia
NeurIPS (2024)
Preview abstract
Building precise simulations of the real world and using numerical methods to solve quantitative problems is an essential task in engineering and physics. We present FEABench, a benchmark to evaluate the ability of language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA) software. We introduce a multipronged evaluation scheme to investigate the ability of LLMs to solve these problems using COMSOL Multiphysics. We further design an LLM agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solution over several iterations. Our best performing strategy generates executable API calls 88% of the time. However, this benchmark still proves to be challenging enough that the LLMs and agents we tested were not able to completely and correctly solve any problem. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world.
View details
Neural general circulation models for weather and climate
Dmitrii Kochkov
Janni Yuval
Ian Langmore
Jamie Smith
Griffin Mooers
Milan Kloewer
James Lottes
Peter Dueben
Samuel Hatfield
Peter Battaglia
Alvaro Sanchez
Matthew Willson
Stephan Hoyer
Nature, 632 (2024), pp. 1060-1066
Preview abstract
General circulation models (GCMs) are the foundation of weather and climate prediction. GCMs are physics-based simulators that combine a numerical solver for large-scale dynamics with tuned representations for small-scale processes such as cloud formation. Recently, machine-learning models trained on reanalysis data have achieved comparable or better skill than GCMs for deterministic weather forecasting. However, these models have not demonstrated improved ensemble forecasts, or shown sufficient stability for long-term weather and climate simulations. Here we present a GCM that combines a differentiable solver for atmospheric dynamics with machine-learning components and show that it can generate forecasts of deterministic weather, ensemble weather and climate on par with the best machine-learning and physics-based methods. NeuralGCM is competitive with machine-learning models for one- to ten-day forecasts, and with the European Centre for Medium-Range Weather Forecasts ensemble prediction for one- to fifteen-day forecasts. With prescribed sea surface temperature, NeuralGCM can accurately track climate metrics for multiple decades, and climate forecasts with 140-kilometre resolution show emergent phenomena such as realistic frequency and trajectories of tropical cyclones. For both weather and climate, our approach offers orders of magnitude computational savings over conventional GCMs, although our model does not extrapolate to substantially different future climates. Our results show that end-to-end deep learning is compatible with tasks performed by conventional GCMs and can enhance the large-scale physical simulations that are essential for understanding and predicting the Earth system.
View details
Mapping the ionosphere with millions of phones
Jamie Smith
Anton Geraschenko
Jade Morton
Frank van Diggelen
Nature (2024)
Preview abstract
The ionosphere is a layer of weakly ionized plasma bathed in Earth’s geomagnetic field extending about 50–1,500 kilometres above Earth1. The ionospheric total electron content varies in response to Earth’s space environment, interfering with Global Satellite Navigation System (GNSS) signals, resulting in one of the largest sources of error for position, navigation and timing services2. Networks of high-quality ground-based GNSS stations provide maps of ionospheric total electron content to correct these errors, but large spatiotemporal gaps in data from these stations mean that these maps may contain errors3. Here we demonstrate that a distributed network of noisy sensors—in the form of millions of Android phones—can fill in many of these gaps and double the measurement coverage, providing an accurate picture of the ionosphere in areas of the world underserved by conventional infrastructure. Using smartphone measurements, we resolve features such as plasma bubbles over India and South America, solar-storm-enhanced density over North America and a mid-latitude ionospheric trough over Europe. We also show that the resulting ionosphere maps can improve location accuracy, which is our primary aim. This work demonstrates the potential of using a large distributed network of smartphones as a powerful scientific instrument for monitoring Earth.
View details
Using large language models to accelerate communication for eye gaze typing users with ALS
Subhashini Venugopalan
Katie Seaver
Xiang Xiao
Katrin Tomanek
Sri Jalasutram
Ajit Narayanan
Bob MacDonald
Emily Kornman
Daniel Vance
Blair Casey
Steve Gleason
Nature Communications (2024)
Preview abstract
Accelerating text input in augmentative and alternative communication (AAC) is a long-standing area of research with bearings on the quality of life in individuals with profound motor impairments. Recent advances in large language models (LLMs) pose opportunities for re-thinking strategies for enhanced text entry in AAC. In this paper, we present SpeakFaster, consisting of an LLM-powered user interface for text entry in a highly-abbreviated form, saving 57% more motor actions than traditional predictive keyboards in offline simulation. A pilot study on a mobile device with 19 non-AAC participants demonstrated motor savings in line with simulation and relatively small changes in typing speed. Lab and field testing on two eye-gaze AAC users with amyotrophic lateral sclerosis demonstrated text-entry rates 29–60% above baselines, due to significant saving of expensive keystrokes based on LLM predictions. These findings form a foundation for further exploration of LLM-assisted text entry in AAC and other user interfaces.
View details
Speech Intelligibility Classifiers from 550k Disordered Speech Samples
Katie Seaver
Richard Cave
Neil Zeghidour
Rus Heywood
Jordan Green
ICASSP, Icassp submission. 2022 (2023)
Preview abstract
We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a fivepoint scale. We trained three models following different deep learning approaches and evaluated them on ∼94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers, ∼2300 samples).
View details