Ali Heydari

Ali Heydari

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    The Anatomy of a Personal Health Agent
    Ahmed Metwally
    Ken Gu
    Jiening Zhan
    Kumar Ayush
    Hong Yu
    Amy Lee
    Qian He
    Zhihan Zhang
    Isaac Galatzer-Levy
    Xavi Prieto
    Andrew Barakat
    Ben Graef
    Yuzhe Yang
    Daniel McDuff
    Brent Winslow
    Shwetak Patel
    Girish Narayanswamy
    Conor Heneghan
    Max Xu
    Jacqueline Shreibati
    Mark Malhotra
    Orson Xu
    Tim Althoff
    Tony Faranesh
    Nova Hammerquist
    Vidya Srinivas
    arXiv (2025)
    Preview abstract Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the solution to fulfill diverse needs from individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health assistant that is able to reason about multimodal data from everyday consumer devices and personal health records. To understand end users’ needs when interacting with such an assistant, we conducted an in-depth analysis of query data from users, alongside qualitative insights from users and experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist subagent: (1) a data science agent that analyzes both personal and population-level time-series wearable and health record data to provide numerical health insights, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights based on medical and contextual user knowledge, and (3) a health coach agent that synthesizes data insights, drives multi-turn user interactions and interactive goal setting, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop a multi-agent framework, Personal Health Insight Agent Team (PHIAT), that enables dynamic, personalized interactions to address individual health needs. To evaluate these individual agents and the multi-agent system, we develop a set of N benchmark tasks and conduct both automated and human evaluations, involving 100’s of hours of evaluation from health experts, and 100’s of hours of evaluation from end-users. Our work establishes a strong foundation towards the vision of a personal health assistant accessible to everyone in the future and represents the most comprehensive evaluation of a consumer AI health agent to date. View details
    RADAR: Benchmarking Language Models on Imperfect Tabular Data
    Ken Gu
    Kumar Ayush
    Hong Yu
    Zhihan Zhang
    Yuzhe Yang
    Shwetak Patel
    Max Xu
    Mark Malhotra
    Orson Xu
    Evelyn Zhang
    Tim Althoff
    2025
    Preview abstract Language models (LMs) are increasingly being deployed to perform autonomous data analyses, yet their~\textit{\robustnessTerm}-- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains under-explored. These artifacts are common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data awareness on tabular data. RADAR introduces programmatic perturbations for each unique query table pair, enabling targeted evaluation of model behavior. RADAR~ comprises 2500 queries for data analysis across 55 datasets spanning 20 domains and 5 data awareness dimensions. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance scales with input length. In our evaluation, we identify fundamental gaps in their ability to perform reliable, data-aware analyses. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning. View details
    A Scalable Framework for Evaluating Health Language Models
    Neil Mallinar
    Tony Faranesh
    Brent Winslow
    Nova Hammerquist
    Ben Graef
    Cathy Speed
    Mark Malhotra
    Shwetak Patel
    Xavi Prieto
    Daniel McDuff
    Ahmed Metwally
    (2025)
    Preview abstract Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health. View details
    Preview abstract Blood biomarkers are an essential tool for healthcare providers to diagnose, monitor, and treat a wide range of medical conditions. Establishing personalized blood biomarker ranges is crucial for accurate dis-ease diagnosis and management. Current clinical ranges often rely on population-level statistics, which may not adequately account for the substantial influence of inter-individual variability driven by factors such as lifestyle and genetics. In this work, we introduce a novel framework for predicting future blood biomarker values and personalized reference ranges through learned representations from lifestyle data (physical activity and sleep) and blood biomarkers. Our proposed method learns a similarity-based embedding space that aims to capture the complex relationship between biomarkers and lifestyle factors. UsingUK Biobank (257K participants), our results show that our deep-learned embeddings outperform traditional and cur-rent state-of-the-art representation learning techniques in predicting clinical diagnosis. Using a subset of UK Biobank of 6440 participants who have follow up visits, we validate that the inclusion of these embeddings and lifestyle factors directly in blood biomarker models improves the prediction of future lab values from a single lab visit. This personalized modeling approach provides a foundation for developing more accurate risk stratification tools and tailoring preventative care strategies. In clinical settings, this translates to the potential for earlier disease detection, more timely interventions, and ultimately, a shift towards personalized healthcare. View details
    Preview abstract Blood tests are an essential tool for healthcare providers to diagnose, monitor, and treat a wide range of medical conditions; however, quantitative approaches for personalizing such metrics are nascent and often ignore important factors such as lifestyle. Moreover, recent studies have shown that raw (untransformed) representations of health records are inadequate for constructing predictive models, especially when considering a single timepoint. In this work, we investigate the association of activity and sleep with blood test ranges, and based on our results, propose Proteus, a new deep metric learning algorithm that accounts for lifestyle. We show that Proteus significantly improves the performance of several downstream analyses, including the prediction of future health risk in currently-healthy patients using a single laboratory visit. Building upon our findings, we additionally introduce DeepRange, a novel lifestyle-informed algorithm which utilizes deep-learned embeddings for estimating personalized optimal blood test ranges. Our proposed methodology for personalized blood test ranges and single-visit health risk prediction can be readily implemented and has the potential to significantly improve health outcomes by enabling early intervention and personalized treatment. View details
    Preview abstract We propose a novel formulation of the triplet objective function that improves metric learning without additional sample mining or overhead costs. Our approach aims to explicitly regularize the distance between the positive and negative samples in a triplet with respect to the anchor-negative distance. As an initial validation, we show that our method (called No Pairs Left Behind [NPLB]) improves upon the traditional and current state-of-the-art triplet objective formulations on standard benchmark datasets. To show the effectiveness and potentials of NPLB on real-world complex data, we evaluate our approach on a large-scale healthcare dataset (UK Biobank), demonstrating that the embeddings learned by our model significantly outperform all other current representations on tested downstream tasks. Additionally, we provide a new model-agnostic single-time health risk definition that, when used in tandem with the learned representations, achieves the most accurate prediction of subjects' future health complications. Our results indicate that NPLB is a simple, yet effective framework for improving existing deep metric learning models, showcasing the potential implications of metric learning in more complex applications, especially in the biological and healthcare domains. View details