Pi-Chuan Chang

Pi-Chuan Chang

Pi-Chuan is the technical lead for the open source project DeepVariant at Google Health. She began working on DeepVariant before its first open source release in December 2017, and has led multiple releases over the years. At Google, she has led machine learning projects with public launches in various product areas, such YouTube and Search. Pi-Chuan holds a CS PhD from Stanford, specializing in natural language processing and machine translation. Pi-Chuan also has a BS and MS from National Taiwan University, where she worked on better language modeling for Chinese speech recognition systems.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Local read haplotagging enables accurate long-read small variant calling
    Daniel Cook
    Maria Nattestad
    John E. Gorzynski
    Sneha D. Goenka
    Euan Ashley
    Miten Jain
    Karen Miga
    Benedict Paten
    Andrew Carroll
    Kishwar Shafin
    bioRxiv (2023)
    Preview abstract Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third- generation sequencing like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio revio platfrom, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for all long-read sequencing platforms. View details
    Preview abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel. View details
    A deep-learning-based RNA-seq germline variant caller
    Aarti Venkat
    Andrew Carroll
    Daniel Cook
    Dennis Yelizarov
    Francisco De La Vega
    Yannick Pouliot
    Bioinformatics Advances (2023)
    Preview abstract RNA-seq is a widely used technology for quantifying and studying gene expression. Many other applications have been developed for RNA-seq as well such as identifying quantitative trait loci, or identifying gene fusion events. However, germline variant calling has not been widely used because RNA-seq data tend to have high error rates and require special processing by variant callers. Here, we introduce a DeepVariant RNA-seq model capable of producing highly accurate variant calls from RNA-sequencing data. Our model outperforms existing approaches such as Platypus and GATK. We examine factors that influence accuracy, how our model addresses RNA editing events, and how additional thresholding can be used to allow for our models' use in a production pipeline. View details
    Preview abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel. View details
    A draft human pangenome reference
    Wen-Wei Liao
    Mobin Asri
    Jana Ebler
    Daniel Doerr
    Marina Haukness
    Shuangjia Lu
    Julian K. Lucas
    Jean Monlong
    Haley J. Abel
    Silvia Buonaiuto
    Xian Chang
    Haoyu Cheng
    Justin Chu
    Vincenza Colonna
    Jordan M. Eizenga
    Xiaowen Feng
    Christian Fischer
    Robert S. Fulton
    Shilpa Garg
    Cristian Groza
    Andrea Guarracino
    William T. Harvey
    Simon Heumos
    Kerstin Howe
    Miten Jain
    Tsung-Yu Lu
    Charles Markello
    Fergal J. Martin
    Matthew W. Mitchell
    Katherine M. Munson
    Moses Njagi Mwaniki
    Adam M. Novak
    Hugh E. Olsen
    Trevor Pesout
    David Porubsky
    Pjotr Prins
    Jonas A. Sibbesen
    Jouni Sirén
    Chad Tomlinson
    Flavia Villani
    Mitchell R. Vollger
    Lucinda L Antonacci-Fulton
    Gunjan Baid
    Carl A. Baker
    Anastasiya Belyaeva
    Konstantinos Billis
    Andrew Carroll
    Sarah Cody
    Daniel Cook
    Robert M. Cook-Deegan
    Omar E. Cornejo
    Mark Diekhans
    Peter Ebert
    Susan Fairley
    Olivier Fedrigo
    Adam L. Felsenfeld
    Giulio Formenti
    Adam Frankish
    Yan Gao
    Nanibaa’ A. Garrison
    Carlos Garcia Giron
    Richard E. Green
    Leanne Haggerty
    Kendra Hoekzema
    Thibaut Hourlier
    Hanlee P. Ji
    Eimear E. Kenny
    Barbara A. Koenig
    Jan O. Korbel
    Jennifer Kordosky
    Sergey Koren
    HoJoon Lee
    Alexandra P. Lewis
    Hugo Magalhães
    Santiago Marco-Sola
    Pierre Marijon
    Ann McCartney
    Jennifer McDaniel
    Jacquelyn Mountcastle
    Maria Nattestad
    Sergey Nurk
    Nathan D. Olson
    Alice B. Popejoy
    Daniela Puiu
    Mikko Rautiainen
    Allison A. Regier
    Arang Rhie
    Samuel Sacco
    Ashley D. Sanders
    Valerie A. Schneider
    Baergen I. Schultz
    Kishwar Shafin
    Michael W. Smith
    Heidi J. Sofia
    Ahmad N. Abou Tayoun
    Francoise Thibauld-Nissen
    Francesa Floriana Tricomi
    Justin Wagner
    Brian Walenz
    Jonathan M. D. Wood
    Aleksey V. Zimin
    Guillaume Borque
    Mark J. P. Chaisson
    Paul Flicek
    Adam M. Phillippy
    Justin Zook
    Evan E. Eichler
    David Haussler
    Ting Wang
    Erich D. Jarvis
    Karen H. Miga
    Glenn Hickey
    Erik Garrison
    Tobias Marschall
    Ira M. Hall
    Heng Li
    Benedict Paten
    Nature (2023)
    Preview abstract Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample. View details
    Accurate human genome analysis with Element Avidity sequencing
    Andrew Carroll
    Bryan Lajoie
    Daniel Cook
    Kelly N. Blease
    Kishwar Shafin
    Lucas Brambrink
    Maria Nattestad
    Semyon Kruglyak
    bioRxiv (2023)
    Preview abstract We investigate the new sequencing technology Avidity from Element Biosciences. We show that Avidity whole genome sequencing matches mapping and variant calling accuracy with Illumina at high coverages (30x-50x) and is noticeably more accurate at lower coverages (20x-30x). We quantify base error rates of Element reads, finding lower error rates, especially in homopolymer and tandem repeat regions. We use Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. We show that longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages. View details
    DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
    Aaron Wenger
    Andrew Walker Carroll
    Armin Töpfer
    Ashish Teku Vaswani
    Daniel Cook
    Felipe Llinares
    Gunjan Baid
    Howard Cheng-Hao Yang
    Jean-Philippe Vert
    Kishwar Shafin
    Maria Nattestad
    Waleed Ammar
    William J. Rowell
    Nature Biotechnology (2022)
    Preview abstract Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%. View details
    Ultra-rapid whole genome nanopore sequencing in a critical care setting
    Andrew Carroll
    Ankit Sethia
    Benedict Paten
    Christopher Wright
    Courtney J. Wusthoff
    Daniel R Garalde
    Dianna G. Fisk
    Elizabeth Spiteri
    Euan Ashley
    Fritz J. Sedlazeck
    Gunjan Baid
    Henry Chubb
    Jeffrey W Christle
    Jeffrey W. Christle
    John E. Gorzynski
    Jonathan A Bernstein
    Joseph Guillory
    Joshua W. Knowles
    Katherine Xiong
    Kishwar Shafin
    Kyla Dunn
    Marco Perez
    Maria Nattestad
    Maura RZ Ruzhnikov
    Megan E. Grove
    Mehrzad Samadi
    Michael Ma
    Miten Jain
    Scott R. Ceresnak
    Sneha D. Goenka
    Tanner D. Jensen
    Tia Moscarello
    Tong Zhu
    Trevor Pesout
    New England Journal of Medicine (2022)
    Preview abstract Background Genetic disease is a major contributor to critical care hospitalization, especially in younger patients. While early genetic diagnosis can guide clinical management, the turnaround time for whole genome based diagnostic testing has traditionally been measured in months. Recent programs in neonatal populations have reduced turnaround time into the range of days and shown that rapid genetic diagnosis enhances patient care and reduces healthcare costs. Yet, most decisions in critical care need to be made on hourly timescales. Methods We developed a whole genome sequencing approach designed to provide a genetic diagnosis within hours. Optimized highly parallel nanopore sequencing was coupled to a high-performance cloud compute system to implement near real-time basecalling and alignment followed by accelerated central and graphics processor unit variant calling. A custom scheme for variant prioritization took only minutes to rank variants most likely to be deleterious allowing efficient manual review and classification according to American College of Medical Genetics and Genomics guidelines. Results We performed whole genome sequencing on 12 patients from the critical care units of Stanford hospitals. In 10 cases, the pipeline produced diagnostic results faster than all previously published clinical genome analyses. Per patient, DNA extraction, library preparation, and nanopore sequencing across 48 flow cells generated 173–236 GigaBases of sequencing data in as little as 1:50 hours. After optimization, the average turnaround time was 7:58 hours (range 7:18–9:0 hours). A pathogenic or likely pathogenic variant was identified in five out of 12 patients (42%). After Sanger or short read sequencing confirmation in a CLIA-approved laboratory, this validated diagnosis altered clinical management in every case. Conclusions We developed an approach to make a genetic diagnosis from whole genome sequencing in hours, returning actionable, cost-saving diagnostic information on critical care timescales. View details
    Preview abstract The precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with fastq files, 20 challenge participants applied their variant calling pipeline and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based methods and machine-learning methods scoring best for short-reads and long-read datasets, respectively. New methods out-performed the winners of the 2016 Truth Challenge across technologies, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants. View details