play silent looping video pause silent looping video

From diagnosis to treatment: Advancing AMIE for longitudinal disease management

March 6, 2025

Valentin Liévin, Software Engineer, and Anil Palepu, Research Scientist

We advance AMIE’s capabilities beyond diagnosis towards treating and managing disease over time. In our randomized study, AMIE matched or exceeded clinicians’ management reasoning over multi-visit consultations with professional patient actors, including precisely planning investigations, treatments and prescriptions, and appropriately using trusted clinical guidelines.

Effective clinical reasoning — the totality of all the decisions that go into patient care — is a cornerstone of healthcare. High quality clinical reasoning is a hallmark of expert clinicians and requires not only accurate diagnosis but also sophisticated reasoning about disease progression, therapeutic response, safe medication prescription, and the appropriate use of accepted guidelines or evidence in shared decision-making with patients. Even once a patient’s diagnosis has been established, an optimal management plan often requires monitoring of the patient’s trajectory and experience, personalized treatment plans with informed and shared decision-making, and proactive adjustments based on individual patient needs, preferences, and system constraints. While large language models (LLMs) have shown promise in capabilities underpinning diagnostic dialogue, their capabilities for clinical management reasoning over time remain under-explored.

In “Towards Conversational AI for Disease Management”, we advance the previously-demonstrated diagnostic reasoning capabilities of the Articulate Medical Intelligence Explorer (AMIE) — our research AI system for medical reasoning and conversations — by integrating additional LLM agentic capabilities optimized specifically for clinical management reasoning and dialogue. This enhanced version of AMIE builds on the core strengths of the Gemini family of models, such as state-of-the-art long-context reasoning and lowest-in-class hallucination rates, to incorporate reasoning over the longitudinal (i.e., sequential over time) progression of disease, response to therapy, and information on safe medication prescription and clinical guidelines. It enables AMIE to go beyond diagnosis and towards the support of patients and clinicians in navigating the complexities of next steps. This latest evolution demonstrates how AMIE might engage in longitudinal interactions, ground its reasoning in an evolving body of authoritative clinical knowledge, and provide structured management plans aligned with accepted guidelines.

play silent looping video pause silent looping video

AMIE now supports longitudinal disease management, grounding its reasoning in clinical guidelines and adapting to patient needs across multiple visits.

The challenge of disease management

Clinical care presents unique challenges that extend beyond the initial diagnostic process. Disease management requires consideration of a multitude of factors, including treatment side effects, patient adherence, lifestyle modifications, and the ever-changing landscape of medical research and clinical guidelines. The ability to perform management reasoning has remained an underexplored challenge for AI systems until now.

play silent looping video pause silent looping video

AMIE leverages Gemini's long-context capabilities to access and reason over clinical guidelines, ensuring its recommendations are grounded in evidence-based medicine.

A two-agent architecture for enhanced reasoning

Our work addresses this challenge with a novel approach based on the interplay of two LLM-driven agents, which has similarities to how human clinicians tackle management problems.

The Dialogue Agent is user-facing and equipped to rapidly respond based on its current understanding of the patient. This agent handles the conversational aspects of the interaction, gathering information about the patient’s condition, addressing their concerns, and building rapport. By leveraging natural language processing and empathetic communication techniques, the Dialogue Agent ensures a seamless and engaging user experience.

The Mx Agent (Management Reasoning Agent) deliberately and continuously analyzes the available information, including clinical guidelines and patient-specific data, to optimize management of the patient. Leveraging Gemini’s state-of-the-art long-context capabilities, this agent synthesizes and reasons over large amounts of information — patient dialogues across several visits in addition to hundreds of pages of clinical guidelines — all at once. Using this approach, it produces structured plans for investigations, treatments, and follow-up care, taking into account the latest medical evidence, information gathered during previous visits, and individual patient preferences.

AMIEMx-3-2Agents

AMIE's two-agent architecture: The Dialogue Agent interacts with the patient, while the Mx Agent creates structured management plans based on clinical guidelines. Management plans define the sequence of investigations and treatments recommended for that patient.

Grounding management in clinical guidelines

To ensure reliability and safety, AMIE’s management reasoning capabilities are primarily enabled by scaling test-time compute to perform deep reasoning with structural constraints while grounding recommendations in authoritative clinical knowledge. Here, too, AMIE relies on Gemini for long-context understanding to align its output with relevant and up-to-date clinical practice guidelines and drug formularies.

This involves selecting and processing documents from a comprehensive corpus of clinical guidelines that encompass trusted sources, such as the UK National Institute for Health and Care Excellence Guidance and the BMJ Best Practice guidelines. The Mx Agent then uses these guidelines to inform its decision-making process, ensuring that its recommendations are evidence-based and aligned with community-established best practices.

Intricate structured constraints help guide the model through specified reasoning strategies, while iterative drafting and merging of generated plans helps refine their quality. This allows AMIE to create personalized management plans that are both evidence-based and tailored to the individual patient's needs.

AMIEMx-4-DeepReasoning

AMIE uses deep reasoning with structural constraints (A) to create structured management plans (B) grounded in a case analysis (C) and explicit management goals (D) that include in-visit investigations, ordered investigations, and treatment recommendations, all supported by citations (E). Here we present an example reasoning trace for a fictitious patient.

Evaluating AMIE's performance: The multi-visit OSCE study

To rigorously evaluate AMIE's ability to handle longitudinal disease management, we conducted a randomized, blinded virtual objective structured clinical examination (OSCE) study of simulated text-chat consultations. In this study, AMIE was compared to 20 primary care physicians (PCPs) across 100 multi-visit case scenarios, allowing us to assess its performance in realistic clinical settings.

AMIEMx-5-OSCEStudy

Overview of randomized multi-visit OSCE study.

The multi-visit design of the OSCE study allowed us to evaluate AMIE's ability to 1) remember and synthesize information from previous interactions, 2) adapt management plans based on evolving patient symptoms and test results, and 3) maintain consistent and empathetic communication with a patient throughout the course of treatment.

Specialist physicians evaluated the quality of AMIE's management plans across a range of criteria, including appropriateness, completeness, the use of clinical guidelines, and patient-centeredness.

AMIEMx-6-Management

Specialist physicians (blinded to the source of the plans) rated AMIE's management plans as non-inferior to those of PCPs, with statistically significant improvements in treatment preciseness. Key measures here included selecting appropriate investigations and avoiding inappropriate investigations (i.e., doing tests that should be avoided given the information known). P-values are shown for statistically significant (p < 0.05) differences.

Furthermore, both patient actors and specialist physicians also evaluated AMIE to determine whether its behaviors reflected clinical needs and priorities. We drew inspiration from prior work determining a set of key features of management reasoning and created a pilot evaluation rubric based on these features, which we refer to as Management Reasoning Empirical Key Features (MXEKF). Key measures of MXEKF included prioritization of preferences, constraints and values, communication and shared decision making, contrasting and selection among different options, monitoring and adjustment of the management plan, and prognostication abilities.

AMIEMx-7-MXEKF

AMIE demonstrated consistent performance on key management reasoning metrics (MXEKF), receiving favorable ratings from both patient actors and specialist physicians.

RxQA: Benchmarking medication reasoning

A critical aspect of disease management is the safe and effective use of medications. It is necessary, but not sufficient, to reliably recall medication-specific knowledge with appropriate factuality and topic-specific reasoning. To benchmark AMIE's capabilities in these axes, we contribute RxQA, a novel multiple-choice question set derived from national drug formularies, including the US Food & Drug Administration and British National Formulary.

RxQA comprises 600 questions designed to assess knowledge of medication indications, contraindications, dosages, side effects, and interactions. The questions were carefully validated by board-certified pharmacists to ensure their accuracy and relevance to clinical practice.

AMIEMx-8-RxQA

Example question from the RxQA benchmark, designed to assess medication knowledge and reasoning. All data shown is synthetic (realistic but not real) patient data.

AMIEMx-9-RxQAResults

AMIE achieved strong performance on the RxQA benchmark, demonstrating a robust understanding of medication information and guidelines. The dotted line represents accuracy achievable through random guessing.

Limitations

While these results showcase AMIE's potential in a new and important area for medical applications of AI, several limitations warrant consideration. The simulated OSCE scenarios, while valuable for standardized evaluation, intentionally simplify the complexities of real-world clinical practice, which includes chart review, interaction with an electronic health record, and a far broader range of patients and pathologies. In this evaluation, guidelines from a single health system were selected and no attempts were made to adapt them to local contexts, whereas that ability is one of the potential benefits of AMIE. The short intervals between simulated visits and a text-based interface, unlike the multimodal experience of real telehealth, likely underestimated real-world difficulty. The MXEKF scale, though promising as a pilot assessment rubric, requires further validation.

Conclusion

AMIE's strong performance across these evaluations represents a significant step towards demonstrating the potential of conversational AI as a powerful tool to assist physicians in disease management. By combining longitudinal reasoning, clinical guideline grounding, and multi-agent system design, AMIE demonstrates the “art of the possible” for AI systems beyond differential diagnosis, towards longitudinal management.

Further research is needed before real-world translation to better understand potential impacts of AMIE on clinical workflow and patient outcomes as well as the safety and reliability of the system under real-world constraints. We are already embarking on a prospective research study with our clinical partners. However, this work is an important milestone in the responsible development and the potential of AI to improve access to evidence-based care.

Acknowledgements

The research described here is joint work across many teams at Google Research and Google DeepMind. We are grateful to all our co-authors and would like to thank John Guilyard, Brian Gabriel and Jenn Sturgeon for contributions to the narratives and visuals. We are grateful to our partners at BMJ Best Practice, the UK National Institute for Health and Care Excellence, and the Royal Pharmaceutical Society. Finally, we thank Avinatan Hassidim, Yossi Matias, James Manyika, Ewa Dominowska, Juro Gottweis, Katherine Chou, Claire Cui, Ali Eslami, Greg S. Corrado, Michael Howell, Karen DeSalvo, Jeff Dean, Zoubin Ghahramani and Demis Hassabis for their support during the course of this project.