
The anatomy of a personal health agent
September 30, 2025
Xuhai “Orson” Xu, Visiting Faculty Researcher, and Ali Heydari, Research Scientist, Google Research
Learn about our research prototype, an LLM-powered personal health agent that analyzes data from everyday wellness devices paired with health data, such as blood biomarkers, to offer evidence-based health insights and to provide a personalized coaching experience.
Quick links
The rapid advancement of large language models (LLMs), combined with data from wearable devices, presents a transformative opportunity to empower people on their personal health journeys. However, health needs vary from individual to individual. Answering a specific query, such as, "On average, how many hours have I been sleeping this last month?" requires different skills than an open-ended question like, "What can I do to improve my sleep quality?" A single system can struggle to address this complexity.
To meet this challenge, we adopt a human-centered process and propose the Personal Health Agent (PHA). This agent is a comprehensive research framework that can reason about multimodal data to provide personalized, evidence-based guidance. Using a multi-agent architecture, PHA deconstructs personal health and wellness support into three core roles (data science, domain expert, and health coach), each handled by a specialist sub-agent. To evaluate each sub-agent and the multi-agent system, we leveraged a real-world dataset from an IRB-reviewed study where ~1200 users provided informed consent to share their wearables data from Fitbit, a health questionnaire, and blood test results. We conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.
This work outlines a conceptual framework for research purposes, and should not be considered a description of any specific product, service, or feature currently in development or available to the public. Any real-world application would be subject to a separate design, validation, and review process.
An illustration of the internal functions of Personal Health Agent (PHA) that enable it to support personal health needs.
User-centered design for personal health needs
To build an agent that truly meets these diverse needs, we started with a user-centered design process. We synthesized insights from over 1,300 real-world health queries from online sources, such as health forums, survey data from more than 500 users, and a workshop with design and engineering experts. This research revealed four critical areas where people need support: understanding general health topics, interpreting their personal data, getting actionable wellness advice, and assessing symptoms. This insight led us to design the PHA system that resembles human expert teams, including data scientists, domain experts, and personal health coaches.

A user-centered process to identify critical user journeys.
Evaluation of our proposed system
To validate our system, we developed a holistic, multi-level evaluation framework. We first benchmarked each individual sub-agent on their unique core capabilities against the state-of-the-art LLM model as the base model, and then assessed the fully integrated PHA’s overall efficacy. The table below shows our comprehensive evaluation, which involved both automated and extensive human evaluations across 10 benchmark tasks, incorporating over 1,100 hours of effort from both end-users and health experts to assess performance in realistic, multi-modal conversations.

Description of our comprehensive evaluation of individual sub-agents and the final Personal Health Agent (PHA) system.
The data science agent: Personal data analyst
The first specialist is the data science (DS) agent, which analyzes personal time-series data from wearables plus health data, such as blood biomarkers, to provide contextualized numerical insights. The DS agent builds on top of a base model (e.g., Gemini) and is enhanced by a two-stage data science module: Stage 1) interpret underspecified and ambiguous user queries (e.g., “Am I getting more fit recently?”), and Stage 2) translate them into robust statistical analysis plans. It then generates and executes code to produce a statistically valid, data-driven answer.
We developed two auto-evaluation benchmarks for each stage of the DS agent's workflow. For the first stage, analysis planning, we used an auto-evaluator trained on 354 query-analysis plans curated by 10 expert data scientists. Based on a detailed rubric assessing dimensions like data sufficiency, statistical validity, and alignment with the user's query, our evaluations showed that the DS agent significantly outperforms the base model in creating high-quality analysis plans (achieving a 75.6% score vs. 53.7% for the baseline). For the second stage, code generation, the agent’s output was benchmarked against 173 rigorous unit tests written by data scientists. This confirmed the agent is more reliable at generating accurate, executable code used to derive insights from time-series wearable data.

DS agent: Results of evaluating data analysis plan generated by the DS agent and the base model across six dimensions, as evaluated by human data scientist and auto raters.
The domain expert agent: Grounded, trustworthy knowledge
Next is the domain expert (DE) agent, which functions as a reliable source of health and wellness knowledge. In a high-stakes domain like health and wellbeing, ensuring information is accurate and trustworthy is critical. The DE agent enhances a base model by using a multi-step reasoning framework and a toolbox that includes access to authoritative sources, such as the National Center for Biotechnology Information (NCBI) database, to ground its responses in verifiable facts. It excels at tailoring information to a user’s specific profile, such as pre-existing conditions. We developed two auto-evaluation benchmarks to test the DE agent’s medical knowledge (one evaluating our agent’s performance on board certification and coaching exam questions, and one for providing accurate differential diagnosis). We further developed two human-evaluation benchmarks (one for clinicians, and one for consumers) to measure the DE agent’s capability on personalization and multi-modal reasoning. Our DE Agent consistently outperforms the base model across all benchmarks. For instance, clinicians rated the DE agent's summaries of multimodal health data as significantly more clinically relevant and useful, and end-users found its responses to be substantially more personalized and trustworthy.

DE agent: Results of evaluating multi-modal reasoning of the DE agent and the base model across seven clinical dimensions, as evaluated by clinical experts.
The health coach agent: Guiding behavior change
The third specialist is the health coach (HC) agent, which is designed to support users in setting goals and fostering lasting behavioral change through multi-turn conversations. Effective coaching requires a delicate balance between gathering information and providing actionable advice. The HC agent employs a modular architecture inspired by proven psychological strategies (e.g., motivational interviewing) to navigate this dynamic, leading to more natural and effective interactions. We benchmarked the HC agent’s performance in two human-evaluation setups, one with end-users and the other with health coaching experts, evaluating our model’s ability across several key areas. For the end-user evaluation, we focused on conversational experience, goal-oriented effectiveness, and motivational support. For the expert evaluation, we assessed adherence to professional coaching principles, recommendation quality, and agent credibility. Both evaluation aspects indicate that the HC agent is significantly more capable than the baseline, underscoring a key insight from our research: for coaching agents, users prioritize core competency and the ability to provide actionable guidance.

HC agent: Results of evaluating coaching experience of the HC agent and base model on six dimensions, evaluated by human health and coaching experts.
The Personal Health Agent (PHA): A collaborative team
While each agent is powerful alone, the true potential is realized when they collaborate. The Personal Health Agent (PHA) framework integrates these three specialists into a cohesive team managed by an intelligent orchestrator. When a user poses a query, the orchestrator analyzes the user's need, dynamically assigns a "main" agent and "supporting" agents, and facilitates an iterative workflow of collaboration, reflection, and memory updates to synthesize a single, comprehensive response.
A technical breakdown of the DS, DE, and HC agents, with orchestration into the Personal Health Agent (PHA).
This collaborative approach proved to significantly outperform the sum of its parts. In extensive evaluations of rubrics assessing agents' capability in synthesizing personal health data to help users answer their health and wellness queries, as well as achieving personal health goals, both end-users and health experts preferred the PHA over (i) a powerful single-agent system that also builds on a base model that uses tools to achieve three roles within a single agent setup, and (ii) a parallel multi-agent baseline that includes the same DS, DE, and HC agents, but simply calls all three agents and synthesizes their results without dynamic orchestration. Both end-users and experts ranked PHA as the best overall system in the majority of cases. This provides a strong example of how the value of emulating the collaborative structure of human expert teams is key to providing truly helpful support.

PHA: Results of evaluating responses generated by the PHA and other baselines, evaluated by human experts.

PHA: Results of ranking responses generated by the PHA and other baselines, evaluated by human experts.
The future of intelligent personal health agents
Creating AI systems that can interpret complex health and wellness data and provide actionable wellness advice has been a longstanding challenge in the field. Our research provides a validated conceptual blueprint for designing the next generation of personal health AI, advocating a shift away from monolithic models toward modular, collaborative systems that are more trustworthy, coherent, and helpful.