Sameera Ponda

Sameera Ponda

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    CURIE: Evaluating LLMs on multitask long context scientific understanding and reasoning
    Matthew Abraham
    Haining Pan
    Zahra Shamsi
    Muqthar Mohammad
    Chenfei Jiang
    Ruth Alcantara
    Gowoon Cheon
    Xuejian Ma
    Michael Statt
    Jackson Cui
    Nayantara Mudur
    Eun-Ah Kim
    Paul Raccuglia
    Victor V. Albert
    Lizzie Dorfman
    Brian Rohr
    Shutong Li
    Maria Tikhanovskaya
    Drew Purves
    Elise Kleeman
    Philippe Faist
    Ean Phing VanLee
    International Conference on Learning Representations (ICLR) (2025)
    Preview abstract The core of the scientific problem-solving process involves synthesizing information while applying expert knowledge. Large Language Models (LLMs) have the potential to accelerate this process due to their extensive knowledge across a variety of domains. Recent advancements have also made it possible for LLMs to handle very long "in-context" content. However, existing evaluations of long-context LLMs have focused on assessing their ability to summarize or retrieve information within the given context, primarily in generalist tasks that do not require deep scientific expertise. To facilitate analogous assessments of domain-specific tasks, we introduce the scientific long-Context Understanding and Reasoning Inference Evaluations (CURIE) benchmark. This benchmark provides a set of 8 challenging tasks, derived from around 250 scientific research papers, requiring domain expertise, comprehension of long in-context information, and multi-step reasoning that tests the ability of LLMs to assist scientists in realistic workflows. Tasks in CURIE have been collected from experts in six disciplines - materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and protein sequencing - covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on these tasks. Additionally, we propose strategies for task decomposition, which allow for a more nuanced evaluation of the models and facilitate staged multi-step assessments. We hope that insights gained from CURIE can guide the future development of LLMs. View details
    ×