Yossi Matias

Yossi Matias

Yossi Matias is Vice President, Google, and the Head of Google Research.

Under Yossi’s leadership, world-class global teams are leading breakthrough research on Foundational Machine Learning & Algorithms, Computing Systems & Quantum, Science, AI for Societal Impact in Health, Climate, Sustainability, Education and Socio-technical research, and advancement of Generative AI - driving real-world impact and shaping the future of technology.

Yossi was previously on Google Search leadership for over a decade, driving strategic features and technologies, and pioneered Conversational AI innovations to help transform the phone experience and help remove barriers of modality and languages. He was also the founding lead of Google center in Israel and supported other global sites. During his tenure at Google Yossi founded and spearheaded initiatives such as Google's AI for Social Good, Crisis Response, Google for Startups Accelerator, social and cultural initiatives seeding Google Arts & Culture, and programs fostering startups, sustainability, and STEM and AI literacy for youth.

Prior to Google Yossi was on the Computer Science faculty at Tel Aviv University, a visiting professor at Stanford, and a Research Scientist at Bell Labs. He’s published over 150 papers and is the inventor of over 70 patents. He pioneered some of the early technologies for internet privacy, contextual search, and the effective analysis of Big Data. He is a recipient of the Gödel Prize, an ACM Fellow, and a recipient of the ACM Kanellakis Theory and Practice Award for seminal work on streaming algorithms, data sketches, and large-scale data analytics

Yossi has a track record of impact-driven breakthrough research, innovation, advancing society-centered AI to help address global challenges and extensive product leadership. He is committed to advancing research and AI to help improve lives, transform society, and create a better future for all.

--------------------------------------------------

More about some of Yossi's work over the past years:

His team's work on Google’s Health AI is driving AI research to help transform healthcare, and help make healthcare more accessible for everyone, with multiple breakthroughs including Med-PaLM. Leadership of AI for Climate and Sustainability, work on climate crisis mitigation (e.g., Greenlight) as well as adaptation - with leadership work on Google’s Crisis Response initiative (SOS alerts, flood forecasting, wildfire detection).

Yossi's work on Generative AI includes efforts on Factuality (inc leading to Bard’s Double Check) and Efficiency (Speculative Decoding). He pioneered Conversational AI innovations towards ambient intelligence to help transforming the phone experience (Google Duplex, Call Screen, Hold for Me) and help remove barriers of modality and languages (Live Caption, Live Relay, Euphonia, Read Aloud).

He is a founding lead of Google’s AI for Social Good, Google for Startup Accelerator (with particular focus on AI & ML), Startups for Sustainable Development and Mind the Gap. He pioneered an initiative of bringing online hundreds of heritage collections (including the Dead Sea Scrolls), and helped establish Google’s Cultural Institute.

His work on Search included (Autocomplete, Google Trends, Search Console, and Search experiences in weather, sports, dictionary and more). He founded and has lead Google’s center in Israel, through growth to over 2500 on staff, supported Google’s growth (4X) in Bangalore, India, and oversaw Google’s Expanding Research Center in Africa.

Yossi is an ACM Fellow for contributions to the analysis of large data sets and data streams. His foundational work on data streams, data synopses and sketches, motivated by computational challenges in what was then the world’s largest data warehouses, was recognized with the the Gödel Prize in Theoretical Computer Science, and with the ACM Kanellakis Theory and Practice Award for the instrumental role it played "in the development of the field of streaming algorithms, which is one of the most prolific and highly regarded areas of data management research" and its broad applicability to large-scale data analytics.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Differences between Patient and Clinician Submitted Images: Implications for Virtual Care of Skin Conditions
    Rajeev Rikhye
    Grace Eunhae Hong
    Margaret Ann Smith
    Aaron Loh
    Vijaytha Muralidharan
    Doris Wong
    Michelle Phung
    Nicolas Betancourt
    Bradley Fong
    Rachna Sahasrabudhe
    Khoban Nasim
    Alec Eschholz
    Kat Chou
    Peggy Bui
    Justin Ko
    Steven Lin
    Mayo Clinic Proceedings: Digital Health (2024)
    Preview abstract Objective: To understand and highlight the differences in clinical, demographic, and image quality characteristics between patient-taken (PAT) and clinic-taken (CLIN) photographs of skin conditions. Patients and Methods: This retrospective study applied logistic regression to data from 2500 deidentified cases in Stanford Health Care’s eConsult system, from November 2015 to January 2021. Cases with undiagnosable or multiple conditions or cases with both patient and clinician image sources were excluded, leaving 628 PAT cases and 1719 CLIN cases. Demographic characteristic factors, such as age and sex were self-reported, whereas anatomic location, estimated skin type, clinical signs and symptoms, condition duration, and condition frequency were summarized from patient health records. Image quality variables such as blur, lighting issues and whether the image contained skin, hair, or nails were estimated through a deep learning model. Results: Factors that were positively associated with CLIN photographs, post-2020 were as follows: age 60 years or older, darker skin types (eFST V/VI), and presence of skin growths. By contrast, factors that were positively associated with PAT photographs include conditions appearing intermittently, cases with blurry photographs, photographs with substantial nonskin (or nail/hair) regions and cases with more than 3 photographs. Within the PAT cohort, older age was associated with blurry photographs. Conclusion: There are various demographic, clinical, and image quality characteristic differences between PAT and CLIN photographs of skin concerns. The demographic characteristic differences present important considerations for improving digital literacy or access, whereas the image quality differences point to the need for improved patient education and better image capture workflows, particularly among elderly patients. View details
    Preview abstract Background: Skin conditions are extremely common worldwide, and are an important cause of both anxiety and morbidity. Since the advent of the internet, individuals have used text-based search (eg, “red rash on arm”) to learn more about concerns on their skin, but this process is often hindered by the inability to accurately describe the lesion’s morphology. In the study, we surveyed respondents’ experiences with an image-based search, compared to the traditional text-based search experience. Methods: An internet-based survey was conducted to evaluate the experience of text-based vs image-based search for skin conditions. We recruited respondents from an existing cohort of volunteers in a commercial survey panel; survey respondents that met inclusion/exclusion criteria, including willingness to take photos of a visible concern on their body, were enrolled. Respondents were asked to use the Google mobile app to conduct both regular text-based search (Google Search) and image-based search (Google Lens) for their concern, with the order of text vs. image search randomized. Satisfaction for each search experience along six different dimensions were recorded and compared, and respondents’ preferences for the different search types along these same six dimensions were recorded. Results: 372 respondents were enrolled in the study, with 44% self-identifying as women, 86% as White and 41% over age 45. The rate of respondents who were at least moderately familiar with searching for skin conditions using text-based search versus image-based search were 81.5% and 63.5%, respectively. After using both search modalities, respondents were highly satisfied with both image-based and text-based search, with >90% at least somewhat satisfied in each dimension and no significant differences seen between text-based and image-based search when examining the responses on an absolute scale per search modality. When asked to directly rate their preferences in a comparative way, survey respondents preferred image-based search over text-based search in 5 out of 6 dimensions, with an absolute 9.9% more preferring image-based search over text-based search overall (p=0.004). 82.5% (95% CI 78.2 - 86.3) reported a preference to leverage image-based search (alone or in combination with text-based search) in future searches. Of those who would prefer to use a combination of both, 64% indicated they would like to start with image-based search, indicating that image-based search may be the preferred entry point for skin-related searches. Conclusion: Despite being less familiar with image-based search upon study inception, survey respondents generally preferred image-based search to text-based search and overwhelmingly wanted to include this in future searches. These results suggest the potential for image-based search to play a key role in people searching for information regarding skin concerns. View details
    Preview abstract Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. View details
    Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
    Brenna Li
    Amy Wang
    Patricia Strachan
    Julie Anne Seguin
    Sami Lachgar
    Karyn Schroeder
    Renee Wong
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
    Preview abstract Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Preview abstract Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS’s C-statistic (71.1%, 95% CI 69.9–72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7–72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6–1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions. View details
    Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
    Terry Spitz
    Malcolm Chelliah
    Heather Cole-Lewis
    Stephanie Farquhar
    Qinghan Xue
    Jenna Lester
    Cían Hughes
    Patricia Strachan
    Fraser Tan
    Peggy Bui
    Craig Mermel
    Lily Peng
    Sunny Virmani
    Ivor Horn
    Cameron Chen
    The Lancet eClinicalMedicine (2024)
    Preview abstract Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Preview abstract Understanding complex relationships between human behavior and local contexts is crucial for various applications in public health, social science, and environmental studies. Traditional approaches often make use of small sets of manually curated, domain-specific variables to represent human behavior, and struggle to capture these intricate connections, particularly when dealing with diverse data types. To address this challenge, this work introduces a novel approach that leverages the power of graph neural networks (GNNs). We first construct a large dataset encompassing human-centered variables aggregated at postal code and county levels across the United States. This dataset captures rich information on human behavior (internet search behavior and mobility patterns) along with environmental factors (local facility availability, temperature, and air quality). Next, we propose a GNN-based framework designed to encode the connections between these diverse features alongside the inherent spatial relationships between postal codes and their containing counties. We then demonstrate the effectiveness of our approach by benchmarking the model on 27 target variables spanning three distinct domains: health, socioeconomic factors, and environmental measurements. Through spatial interpolation, extrapolation, and super-resolution tasks, we show that the proposed method can effectively utilize the rich feature set to achieve accurate predictions across diverse geospatial domains. View details
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details