Pi-Chuan Chang
Pi-Chuan is the technical lead for the open source project DeepVariant at Google Health. She began working on DeepVariant before its first open source release in December 2017, and has led multiple releases over the years. At Google, she has led machine learning projects with public launches in various product areas, such YouTube and Search. Pi-Chuan holds a CS PhD from Stanford, specializing in natural language processing and machine translation. Pi-Chuan also has a BS and MS from National Taiwan University, where she worked on better language modeling for Chinese speech recognition systems.
Authored Publications
Google Publications
Other Publications
Sort By
Towards Generalist Biomedical AI
Danny Driess
Andrew Carroll
Chuck Lau
Ryutaro Tanno
Ira Ktena
Anil Palepu
Basil Mustafa
Aakanksha Chowdhery
Simon Kornblith
Philip Mansfield
Sushant Prakash
Renee Wong
Sunny Virmani
Christopher Semturs
Sara Mahdavi
Bradley Green
Ewa Dominowska
Joelle Barral
Karan Singhal
Pete Florence
NEJM AI(2024)
Preview abstract
BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery.
METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports.
RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems.
View details
A deep-learning-based RNA-seq germline variant caller
Aarti Venkat
Andrew Carroll
Daniel Cook
Dennis Yelizarov
Francisco De La Vega
Yannick Pouliot
Bioinformatics Advances(2023)
Preview abstract
RNA-seq is a widely used technology for quantifying and studying gene expression. Many other applications have been developed for RNA-seq as well such as identifying quantitative trait loci, or identifying gene fusion events. However, germline variant calling has not been widely used because RNA-seq data tend to have high error rates and require special processing by variant callers. Here, we introduce a DeepVariant RNA-seq model capable of producing highly accurate variant calls from RNA-sequencing data. Our model outperforms existing approaches such as Platypus and GATK. We examine factors that influence accuracy, how our model addresses RNA editing events, and how additional thresholding can be used to allow for our models' use in a production pipeline.
View details
Accurate human genome analysis with Element Avidity sequencing
Andrew Carroll
Bryan Lajoie
Daniel Cook
Kelly N. Blease
Kishwar Shafin
Lucas Brambrink
Maria Nattestad
Semyon Kruglyak
bioRxiv(2023)
Preview abstract
We investigate the new sequencing technology Avidity from Element Biosciences. We show that Avidity whole genome sequencing matches mapping and variant calling accuracy with Illumina at high coverages (30x-50x) and is noticeably more accurate at lower coverages (20x-30x). We quantify base error rates of Element reads, finding lower error rates, especially in homopolymer and tandem repeat regions. We use Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. We show that longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages.
View details
Improving variant calling using population data and deep learning
Andrew Carroll
Nae-Chyun Chen
Sidharth Goel
BMC Bioinformatics (2023)
Preview abstract
Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.
View details
A draft human pangenome reference
Wen-Wei Liao
Mobin Asri
Jana Ebler
Daniel Doerr
Marina Haukness
Shuangjia Lu
Julian K. Lucas
Jean Monlong
Haley J. Abel
Silvia Buonaiuto
Xian Chang
Haoyu Cheng
Justin Chu
Vincenza Colonna
Jordan M. Eizenga
Xiaowen Feng
Christian Fischer
Robert S. Fulton
Shilpa Garg
Cristian Groza
Andrea Guarracino
William T. Harvey
Simon Heumos
Kerstin Howe
Miten Jain
Tsung-Yu Lu
Charles Markello
Fergal J. Martin
Matthew W. Mitchell
Katherine M. Munson
Moses Njagi Mwaniki
Adam M. Novak
Hugh E. Olsen
Trevor Pesout
David Porubsky
Pjotr Prins
Jonas A. Sibbesen
Jouni Sirén
Chad Tomlinson
Flavia Villani
Mitchell R. Vollger
Lucinda L Antonacci-Fulton
Gunjan Baid
Carl A. Baker
Anastasiya Belyaeva
Konstantinos Billis
Andrew Carroll
Sarah Cody
Daniel Cook
Robert M. Cook-Deegan
Omar E. Cornejo
Mark Diekhans
Peter Ebert
Susan Fairley
Olivier Fedrigo
Adam L. Felsenfeld
Giulio Formenti
Adam Frankish
Yan Gao
Nanibaa’ A. Garrison
Carlos Garcia Giron
Richard E. Green
Leanne Haggerty
Kendra Hoekzema
Thibaut Hourlier
Hanlee P. Ji
Eimear E. Kenny
Barbara A. Koenig
Jan O. Korbel
Jennifer Kordosky
Sergey Koren
HoJoon Lee
Alexandra P. Lewis
Hugo Magalhães
Santiago Marco-Sola
Pierre Marijon
Ann McCartney
Jennifer McDaniel
Jacquelyn Mountcastle
Maria Nattestad
Sergey Nurk
Nathan D. Olson
Alice B. Popejoy
Daniela Puiu
Mikko Rautiainen
Allison A. Regier
Arang Rhie
Samuel Sacco
Ashley D. Sanders
Valerie A. Schneider
Baergen I. Schultz
Kishwar Shafin
Michael W. Smith
Heidi J. Sofia
Ahmad N. Abou Tayoun
Francoise Thibauld-Nissen
Francesa Floriana Tricomi
Justin Wagner
Brian Walenz
Jonathan M. D. Wood
Aleksey V. Zimin
Guillaume Borque
Mark J. P. Chaisson
Paul Flicek
Adam M. Phillippy
Justin Zook
Evan E. Eichler
David Haussler
Ting Wang
Erich D. Jarvis
Karen H. Miga
Glenn Hickey
Erik Garrison
Tobias Marschall
Ira M. Hall
Heng Li
Benedict Paten
Nature (2023)
Preview abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
View details
Local read haplotagging enables accurate long-read small variant calling
Daniel Cook
Maria Nattestad
John E. Gorzynski
Sneha D. Goenka
Euan Ashley
Miten Jain
Karen Miga
Benedict Paten
Andrew Carroll
Kishwar Shafin
bioRxiv(2023)
Preview abstract
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-
generation sequencing like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio revio platfrom, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for all long-read sequencing platforms.
View details
Improving variant calling using population data and deep learning
Nae-Chyun Chen
Sidharth Goel
Andrew Carroll
BMC Bioinformatics(2023)
Preview abstract
Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.
View details
Ultra-Rapid Nanopore Whole Genome Genetic Diagnosis of Dilated Cardiomyopathy in an Adolescent With Cardiogenic Shock
John E. Gorzynski
Sneha D. Goenka
Kishwar Shafin
Dianna G. Fisk
Elizabeth Spiteri
Fritz J. Sedlazeck
Miten Jain
Jean Monlong
Trevor Pesout
Jonathan A Bernstein
Andrew Carroll
Kyla Dunn
Benedict Paten
Euan Ashley
Circulation: Genomic and Precision Medicine(2022)
Preview abstract
Rapid genetic diagnosis has the potential to guide clinical treatment in critically ill patients leading to improved prognosis and decreased health care costs. 1 Until recently, the turnaround time for whole genome diagnostic testing precluded its integration into critical care decision making (typical rapid whole genome sequencing clinical testing returns results in 5–7 days). Here, we describe a case of a teenager presenting with cardiogenic shock in whom a genetic diagnosis was made in under 12 hours using a new ultra-rapid long read whole genome sequencing assay and workflow. 2, 3 A 13-year-old male previously in good health presented to his primary care provider with a nocturnal dry cough, decreased appetite, intermittent chest pain, and fatigue. Thoracic radiographs showed cardiomegaly leading to echocardiography, which revealed a dilated left ventricle with an ejection fraction of 29%.
View details
Ultra-rapid whole genome nanopore sequencing in a critical care setting
Andrew Carroll
Ankit Sethia
Benedict Paten
Christopher Wright
Courtney J. Wusthoff
Daniel R Garalde
Dianna G. Fisk
Elizabeth Spiteri
Euan Ashley
Fritz J. Sedlazeck
Gunjan Baid
Henry Chubb
Jeffrey W Christle
Jeffrey W. Christle
John E. Gorzynski
Jonathan A Bernstein
Joseph Guillory
Joshua W. Knowles
Katherine Xiong
Kishwar Shafin
Kyla Dunn
Marco Perez
Maria Nattestad
Maura RZ Ruzhnikov
Megan E. Grove
Mehrzad Samadi
Michael Ma
Miten Jain
Scott R. Ceresnak
Sneha D. Goenka
Tanner D. Jensen
Tia Moscarello
Tong Zhu
Trevor Pesout
New England Journal of Medicine(2022)
Preview abstract
Background
Genetic disease is a major contributor to critical care hospitalization, especially in younger patients. While early genetic diagnosis can guide clinical management, the turnaround time for whole genome based diagnostic testing has traditionally been measured in months. Recent programs in neonatal populations have reduced turnaround time into the range of days and shown that rapid genetic diagnosis enhances patient care and reduces healthcare costs. Yet, most decisions in critical care need to be made on hourly timescales.
Methods
We developed a whole genome sequencing approach designed to provide a genetic diagnosis within hours. Optimized highly parallel nanopore sequencing was coupled to a high-performance cloud compute system to implement near real-time basecalling and alignment followed by accelerated central and graphics processor unit variant calling. A custom scheme for variant prioritization took only minutes to rank variants most likely to be deleterious allowing efficient manual review and classification according to American College of Medical Genetics and Genomics guidelines.
Results
We performed whole genome sequencing on 12 patients from the critical care units of Stanford hospitals. In 10 cases, the pipeline produced diagnostic results faster than all previously published clinical genome analyses. Per patient, DNA extraction, library preparation, and nanopore sequencing across 48 flow cells generated 173–236 GigaBases of sequencing data in as little as 1:50 hours. After optimization, the average turnaround time was 7:58 hours (range 7:18–9:0 hours). A pathogenic or likely pathogenic variant was identified in five out of 12 patients (42%). After Sanger or short read sequencing confirmation in a CLIA-approved laboratory, this validated diagnosis altered clinical management in every case.
Conclusions
We developed an approach to make a genetic diagnosis from whole genome sequencing in hours, returning actionable, cost-saving diagnostic information on critical care timescales.
View details
precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions.
Andrew Carroll
Gunjan Baid
Howard Yang
Maria Nattestad
Sidharth Goel
Cell Genomics(2022)
Preview abstract
The precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with fastq files, 20 challenge participants applied their variant calling pipeline and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based methods and machine-learning methods scoring best for short-reads and long-read datasets, respectively. New methods out-performed the winners of the 2016 Truth Challenge across technologies, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
View details
Preview abstract
Traditional methods that use a linear reference genome for analyses of whole genome sequencing data have been found to be inadequate for detection of structural variants, rare variation and variants that originate in high-complexity or repetitive regions of the human genome. Genome graphs help to systematically embed genetic variation from a population of samples into one reference structure. Though genome graphs have helped to reduce this mapping bias, there are still performance improvements that can be made. Here we present a workflow that uses population and pedigree genetic information to reduce reference bias and improve variant detection sensitivity as well as to generate a small list of candidate variants that are causal to rare genetic disorders at the genome scale.
View details
Best: A Tool for Characterizing Sequencing Errors
Anastasiya Belyaeva
Andrew Carroll
Daniel Cook
Daniel Liu
Kishwar Shafin
bioRxiv(2022)
Preview abstract
Platform-dependent sequencing errors must be understood to develop more accurate sequencing technologies. We propose a new tool, best (Bam Error Stats Tool), for efficiently analyzing and summarizing error types in sequencing reads. best ingests reads aligned to a high-quality reference assembly and produces per-read stats, overall stats, stats for specific genomic intervals, etc. 27 times faster than a previous tool. best has applications in quality control of sequencing runs and evaluating approaches for improving sequencing accuracy.
View details
Technical development of rapid whole genome nanopore sequencing and variant identification pipeline
Andrew Carroll
Ankit Sethia
Benedict Paten
Christopher Wright
Daniel R Garalde
Dianna G. Fisk
Elizabeth Spiteri
Euan Ashley
Fritz J. Sedlazeck
Gunjan Baid
Jean Monlong
Jeffrey W Christle
John E. Gorzynski
Jonathan A Bernstein
Joseph Guillory
Karen P. Dalton
Katherine Xiong
Kishwar Shafin
Maria Nattestad
Maura RZ Ruzhnikov
Megan E. Grove
Mehrzad Samadi
Miten Jain
Sneha D. Goenka
Tanner D. Jensen
Tong Zhu
Trevor Pesout
Nature Biotechnology(2022)
Preview abstract
Whole genome sequencing can identify pathogenic variants for genetic disease but the time
required for sequencing and analysis has been a barrier to its use in acutely ill patients. Here,
we develop an approach to ultra-rapid nanopore whole genome sequencing that combines an
efficient sample preparation protocol, distributed sequencing over 48 flow cells, near real-time
base calling and alignment, accelerated variant calling, and fast variant filtration. We show
that this framework provides accurate variant prioritization in less than half the fastest time
recorded for an equivalent analysis to date.
View details
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
Aaron Wenger
Andrew Walker Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Cheng-Hao Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
Nature Biotechnology(2022)
Preview abstract
Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%.
View details
How DeepConsensus Works
Aaron Wenger
Anastasiya Belyaeva
Andrew Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
(2022)
Preview abstract
N/A
These are slides for a public video about DeepConsensus
View details
Knowledge distillation for fast and accurate DNA sequence correction
Anastasiya Belyaeva
Joel Shor
Daniel Cook
Kishwar Shafin
Daniel Liu
Armin Töpfer
Aaron Wenger
William J. Rowell
Howard Yang
Andrew Carroll
Maria Nattestad
Learning Meaningful Representations of Life (LMRL) Workshop NeurIPS 2022
Preview abstract
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer–encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models.
View details
A population-specific reference panel for improved genotype imputation in African Americans
Jared O’Connell
Meghan Moreno
Helen Li
Nadia Litterman
Elizabeth Noblin
Anjali Shastri
Elizabeth H. Dorfman
Suyash Shringarpure
23andMe Research Team
Adam Auton
Andrew Carroll
Communications Biology(2021)
Preview abstract
There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.
View details
DeepTrio: Variant Calling in Families Using Deep Learning
Gunjan Baid
Howard Yang
Maria Nattestad
Sidharth Goel
bioRxiv(2021)
Preview abstract
Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trio from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio to learn how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling and calling with duos (child and one parent) solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.
View details
Accurate, scalable cohort variant calls using DeepVariant and GLnexus
Helen Li
Michael F. Lin
Andrew Walker Carroll
Bioinformatics(2021)
Preview abstract
Motivation
Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging.
Results
We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline.
Availability and Implementation
We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later.
Supplementary information
Supplementary data are available at Bioinformatics online.
View details
Preview abstract
In this blog, we discuss a new channel in DeepVariant which encodes haplotype information in long-read data, and was released with DeepVariant v1.1. We review how haplotypes relate to variant calling, show examples improved by the channel, and quantify the accuracy improvement with PacBio HiFi.
View details
DeepVariant over the years
Andrew Carroll
Daniel Cook
Gunjan Baid
Howard Yang
Maria Nattestad
(2021)
Preview abstract
The development of DeepVariant was motivated by the following question: if computational biologists can look at pileup images of reads to identify variants, can we train an image classification model to perform this task? To answer this question, we began working on DeepVariant in 2015, and the first open-source version (v0.4) of the software was released in late 2017. Since v0.4, the project has come a long way, and there have been eight additional releases. We originally began development on Illumina whole-genome sequencing (WSG) data, and the first release included one model for this data type. Over the years, we have added support for additional sequencing technologies, and we now provide models for Illumina whole-exome sequencing (WES) data, Pacific Bioscience (PacBio) Hifi data, and a hybrid model for Illumina and PacBio WGS data combined. We have also collaborated with a team at UC Santa Cruz to train DeepVariant using Oxford Nanopore data. The resulting tool, PEPPER-DeepVariant, uses PEPPER to generate candidates more effectively for Nanopore data. In addition to new models, new capabilities have been added, such as the best practices for cohort calling in v0.9 and DeepTrio, a trio and duo caller, in v1.1. For each release, we focus on building highly-accurate models, reducing runtime, and improving the user experience. In this post, we summarize the improvements in accuracy and runtime over the years and highlight a few categories of changes that have led to these improvements.
View details
Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks
Kishwar Shafin
Trevor Pesout
Maria Nattestad
Sidharth Goel
Gunjan Baid
Mikhail Kolmogorov
Jordan M. Eizenga
Karen Miga
Paolo Carnevali
Miten Jain
Andrew Carroll
Benedict Paten
Nature Methods(2021)
Preview abstract
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).
View details
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes
Jouni Sirén
Jean Monlong
Xian Chang
Adam M. Novak
Jordan M. Eizenga
Charles Markello
Jonas A. Sibbesen
Glenn Hickey
Andrew Carroll
Namrata Gupta
Stacey Gabriel
Thomas W. Blackwell
Aakrosh Ratan
Kent D. Taylor
Stephen S. Rich
Jerome I. Rotter
David Haussler
Erik Garrison
Benedict Paten
Science(2021)
Preview abstract
INTRODUCTION
Modern genomics depends on inexpensive short-read sequencing. Sequenced reads up to a few hundred base pairs in length are computationally mapped to estimated source locations in a reference genome. These read mappings are used in myriad sequencing-based assays. For example, through a process called genotyping, mapped reads from a DNA sample can be used to infer the combination of alleles present at each site in the reference genome.
RATIONALE
A single reference genome cannot capture the diversity within even a single person (who gets a genome copy from each parent), let alone in the whole human population. Genomes differ not only by point variations, where one or a few bases are different, but also by structural variations, where differences can be much larger than an individual read. When a person’s genome differs from the reference by a structural variation, the reference may contain no location to correctly map the corresponding reads. Although newer long-read sequencing allows structural variation to be more directly observed in sequencing reads, short-read sequencing is still less expensive and more widely available.
RESULTS
We present a short read–mapping tool, Giraffe. Giraffe maps to a pangenome reference that describes many genomes and the differences between them. Giraffe can accurately map reads to thousands of genomes embedded in a pangenome reference as quickly as existing tools map to a single reference genome. Simulations in which the true mapping for each read is known show that Giraffe is as accurate as the most accurate previously published tool. Giraffe achieves this speed and accuracy by using a variety of algorithmic techniques. In particular, and in contrast to previous tools, it focuses on mapping to the paths in the pangenome that are observed in individuals’ genomes: the reference haplotypes. This has two key benefits. First, it prioritizes alignments that are consistent with known sequences, avoiding combinations of alleles that are biologically unlikely. Second, it reduces the size of the problem by limiting the sequence space to which the reads could be aligned. This deals effectively with complex graph regions where most paths represent rare or nonexistent sequences.
Using Giraffe in place of a single reference genome reduces mapping bias, which is the tendency to incorrectly map reads that differ from the reference genome. Combining Giraffe with state-of-the-art genotyping algorithms demonstrates that Giraffe mappings produce accurate genotyping results.
Using mappings from Giraffe, we genotyped 167,000 recently discovered structural variations in short-read samples for 5202 people at an average computational cost of $1.50 per sample. We present estimates for the frequency of different versions of these structural variations in the human population as a whole and within individual subpopulations. We identify thousands of these structural variations as expression quantitative trait loci (eQTLs), which are associated with gene-expression levels.
CONCLUSION
Giraffe demonstrates the practicality of a pangenomic approach to short-read mapping. This approach allows short-read data to genotype single-nucleotide variations, short insertions and deletions, and structural variations more accurately. For structural variations, this allowed the estimation of population frequencies across a diverse cohort of 5000 individuals. A single reference genome must choose one version of any variation to represent, leaving the other versions unrepresented. By making more broadly representative pangenome references practical, Giraffe attempts to make genomics more inclusive.
View details
Preview abstract
Exome and genome sequencing typically use a reference genome to map reads and call variants against. Many (if not a majority) of clinical and research workflows use the prior version of the human reference genome (GRCh37), although an updated and more complete version (GRCh38) was produced in 2013.
We present a method that identifies potential artifacts when using one reference relative to a different reference. We simulate error-free reads from GRCh37 and GRCh38, and map and call variants from one read set to the opposite reference.
When simulated reads are analyzed relative to their own reference, there are no variants called on GRCh37 and 14 on GRCh38. However, when GRCh38 reads are analyzed on GRCh37, there are 69,720 heterozygous variants called with GATK4-HC. Since the reference is monoploid, a heterozygous call is likely an artifact.
Inspection suggests these represent segmental duplications captured in GRCh38, but excluded or collapsed in GRCh37. Some overlap with common resources: 32,688 are present in dbSNP, 28,830 are present gnomAD (with 25,062 listed as filtered for HWE violation), 19 HET variants and 199 HOM overlap ClinVar. In v3.3.2 Genome in a Bottle, 1,123 of these variants overlap the confident regions for HG002, and they are inconsistently labelled as variants or reference. DeepVariant, which is trained on the truth set, seems to have learned about this variability, allowing some measurement of segmental duplication to be made from its output.
Reverse comparison using GRCh37 reads on GRCh38 finds only 30% as many HET variants. This suggests that migrating workflows to GRCh38 eliminates a number of recurrent artifacts, and could present an additional filtration resource for GRCh37 variant files and annotation resources.
View details
Preview abstract
Introduction: Around 5% (1,168) of protein-coding genes in the human genome contain an exon that is difficult to map with typical next-generation sequencing (NGS) read lengths due to homologous pseudogenes or segmental duplications. Among the difficult-to-map genes are 193 with known medical relevance, including CYP2D6, GBA, SMN1/2, and VWF. Long-read DNA sequencing provides increased mappability, accessing many of the difficult-to-map regions by connecting the homologous exon to neighboring unique sequence. Until recently, the read-level accuracy of long-read sequencing had made it challenging to accurately call small variants. The recently developed HiFi reads from the PacBio Sequel II System provide both long read length (15 kb - 25 kb) for mappability and high read quality (>99%) for accurate variant calling, expanding the regions of the genome that are able to be characterized to high precision and recall.
Materials and Methods: Human reference sample HG002 was sequenced to 35-fold HiFi read coverage on the PacBio Sequel II System. Matched 35-fold coverage with NGS reads was obtained on the Illumina NovaSeq. Reads were mapped to the GRCh38 reference genome using pbmm2 for HiFi reads and BWA for NGS reads. Small variants were called using DeepVariant. The variant callsets were compared to each other and to the Genome in a Bottle (GIAB) v4.1 benchmark within exons previously reported to be problematic for NGS.
Results: For difficult-to-map exons within the GIAB benchmark, HiFi reads detect 1,269 true benchmark variants, 21% more than are detected with NGS reads (1,053). Small variant precision in difficult-to-map exons is 97.7% for HiFi reads, markedly higher than 92.0% for NGS reads. Extending outside of the benchmark, HiFi reads detect 241 small variants missed by NGS reads across 42 difficult-to-map exons of medically relevant genes, including 14 variants in C4A, 5 in SMN1, and 2 in STRC.
Conclusion: HiFi reads have both high mappability and high read quality, which enables accurate small variant calling in difficult-to-map genes that are challenging for NGS. We predict that large-scale use of HiFi reads in disease cohort studies will discover additional disease genes and variants that have remained beyond the reach of NGS.
View details
Preview abstract
In this blog we discuss the newly published use of Pacbio Circular Consensus Sequencing (CCS) at human genome scale. We demonstrate that DeepVariant trained for this data type achieves similar accuracy to available Illumina genomes, and is the only method to achieve competitive accuracy in Indel calling. Early access to this model is available now by request, and we expect general availability in our next DeepVariant release (v0.8)
View details
Preview abstract
In this post, we formulate DNA sequencing error correction as a multiclass classification problem and propose two deep learning solutions. Our first approach corrects errors in a single read, whereas the second approach, shown in Figure 1, builds a consensus from several reads to predict the correct DNA sequence. Our Colab notebook tutorial implements the second approach using the Nucleus and TensorFlow libraries. Our goal is to show how Nucleus can be used alongside TensorFlow for solving machine learning problems in genomics.
View details
Preview abstract
Next Generation Sequencing can sample the whole genome (WGS) or the 1-2% of the genome that codes for proteins called the whole exome (WES). Machine learning approaches to variant calling achieve high accuracy in WGS data, but the reduced number of training examples causes training with WES data alone to achieve lower accuracy. We propose and compare three different data augmentation strategies for improving performance on WES data: 1) joint training with WES and WGS data, 2) warmstarting the WES model from a WGS model, and 3) joint training with the sequencing type specified. All three approaches show improved accuracy over a model trained using just WES data, suggesting the ability of models to generalize insights from the greater WGS data while retaining performance on the specialized WES problem. These data augmentation approaches may apply to other problem areas in genomics, where several specialized models would each see only a subset of the genome.
View details
Preview abstract
In this blog, we discuss how sequencing coverage involves trade-offs between cost and accuracy. We explore how computational methods that improve accuracy can also be understood as reducing cost. We compare current methods to historical accuracies. Finally, we explore the types of errors present at low and high coverages.
View details
Preview abstract
We explore three different training strategies to leverage whole-genome sequencing data to improve model performance for the specialized task of variant calling from whole-exome sequencing data: 1) jointly trainIng with both WGS and WES data, 2) warmstarting from a pre-trained WGS model, and 3) including sequencing type as an input to the model.
View details
Single Molecule High-Fidelity (HiFi) Sequencing with >10 kb Libraries
Aaron Wenger
Andrew Carroll
Arkarachai Fungtammasan
Chen-Shan Chin
Dario Cantu
David R. Rank
Gregory T. Concepcion
Jue Ruan
Paul Peluso
Richard J. Hall
Sergey Koren
William J. Rowell
Plant and Animal Genomes(2019)
Preview abstract
Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced, and de novo assembled with the CANU assembly algorithm generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from the Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) which are specific to each of the three samples.
View details
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Aaron M. Wenger
Paul Peluso
William J. Rowell
Richard J. Hall
Gregory T. Concepcion
Jana Ebler
Arkarachai Fungtammasan
Nathan D. Olson
Armin Töpfer
Michael Alonge
Medhat Mahmoud
Yufeng Qian
Chen-Shan Chin
Adam M. Phillippy
Michael C. Schatz
Gene Myers
Mark A. DePristo
Jue Ruan
Tobias Marschall
Fritz J. Sedlazeck
Justin M. Zook
Heng Li
Sergey Koren
Andrew Carroll
David R. Rank
Michael W. Hunkapiller
Nature Biotechnology(2019)
Preview abstract
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the ‘genome in a bottle’ (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.
View details
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Aaron Wenger
Andrew Carroll
Arkarachai Fungtammasan
Armin Töpfer
Chen-Shan Chin
David R. Rank
Fritz J. Sedlazeck
Gene Myers
Gregory T. Concepcion
Heng Li
Jana Ebler
Jue Ruan
Justin Zook
Mark DePristo
Medhat Mahmoud
Michael Alonge
Michael C. Schatz
Michael W. Hunkapiller
Nathan D. Olson
Paul Peluso
Richard J. Hall
Sergey Koren
Tobias Marschall
William J. Rowell
Yufeng Qian
Nature Biotechnology(2019)
Preview abstract
The major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We develop a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate, long reads and apply it to sequence the well-characterized human, HG002/NA24385, to 28-fold coverage with 13.5 kb CCS reads that average 99.5% accuracy. We apply existing tools to comprehensively detect variants, and achieve precision and recall above 99.9% for SNVs, 95.9% for indels, and 95.2% for structural variants. Nearly all (99.6%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance over Q45 (99.997%). From manual curation of discordances, we estimate 1,283 mistakes in the high-quality Genome in a Bottle benchmark are correctable with CCS reads. With only CCS reads, we match or exceed performance of variant detection with accurate short reads and assembly with noisy long reads.
View details
Preview abstract
In this work, we investigate variant calling across a pedigree of mosquito (Anopheles gambiae) genomes. Using rates of Mendelian violation, we assess pipelines developed to call variation in humans when applied to mosquito samples. We demonstrate the ability to rapidly retrain DeepVariant without the need for a gold standard set by using sites that are consistent versus inconsistent with Mendelian inheritance. We show that this substantially improves calling accuracy by tuning DeepVariant for the new genome context. Finally, we generate a model for accurate variant calling on low-coverage mosquito genomes and a corresponding variant callset.
View details
Preview abstract
In this blog we discuss the newly published use of PacBio Circular Consensus Sequencing (CCS) at human genome scale. We demonstrate that DeepVariant trained for this data type achieves similar accuracy to available Illumina genomes, and is the only method to achieve competitive accuracy in Indel calling. Early access to this model is available now by request, and we expect general availability in our next DeepVariant release (v0.8).
View details
Preview abstract
The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also portrays the rapid adoption of artificial intelligence/deep neural networks in genomics; in particular, deep learning approaches are well suited to model the complex dependencies in the regulatory landscape of the genome, and to provide predictors for genetic variant calling and interpretation.
View details
A universal SNP and small-indel variant caller using deep neural networks
Scott Schwartz
Dan Newburger
Jojo Dijamco
Nam Nguyen
Pegah T. Afshar
Sam S. Gross
Lizzie Dorfman
Mark A. DePristo
Nature Biotechnology(2018)
Preview abstract
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
View details
A deep learning approach to pattern recognition for short DNA sequences
Akosua Busia
George Dahl
Clara Fannjiang
Lizzie Dorfman
Mark DePristo
bioArxiv(2018)
Preview abstract
Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences.
View details
Deterministic Statistical Mapping of Sentences to Underspecified Semantics
Preview
Hiyan Alshawi
Michael Ringgaard
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)
Uptraining for Accurate Deterministic Question Parsing
Preview
Slav Petrov
Michael Ringgaard
Hiyan Alshawi
Proceedings of the 2010 Conference on Empirical Methods on Natural Language Processing (EMNLP '10)
Improving Chinese-English machine translation through better source-side linguistic processing
Ph.D. Thesis(2009)
Disambiguating DE for Chinese-English machine translation
Discriminative reordering with Chinese grammatical relations features
Stanford University’s Chinese-to-English Statistical Machine Translation System for the 2008 NIST Evaluation
Michel Galley
Jenny R. Finkel
Christopher D. Manning
The 2008 NIST Open Machine Translation Evaluation Meeting(2008)
Preview abstract
This document describes Stanford University’s first entry into a NIST MT evaluation. Our entry to the
2008 evaluation mainly focused on establishing a competent baseline with a phrase-based system similar to (Och and Ney, 2004; Koehn et al., 2007). In a three-week effort prior to the evaluation, our attention focused on scaling up our system to exploit nearly all Chinese-English parallel data permissible under the constrained track, incorporating competitive language models into the decoder using Gigaword and Google n-grams, evaluating Chinese word segmentation models, and incorporating a document classifier as a pre-processing stage to the decoder.
This document is organized as follows: in Section 2, we describe linguistic resources used for our
submission. In Section 3, we present the four main components of our translation system, i.e., a phrase-based translation system, a Chinese word segmenter, a text categorizer, and a truecaser. Finally, we discuss our results in Section 4.
View details
Optimizing Chinese word segmentation for machine translation performance
A discriminative syntactic word order model for machine translation
Kristina Toutanova
Association for Computational Linguistics, Prague, Czech Republic(2007), pp. 9-16
Preview abstract
We present a global discriminative statistical word order model for machine translation. Our model combines syntactic movement and surface movement information, and is discriminatively trained to choose among word orders. We show that combining discriminative training with features to detect these two different kinds of movement phenomena leads to substantial improvements in word ordering performance over strong baselines. Integrating this word order model in a baseline MT system results in a 2.4 points improvement in BLEU for English to Japanese translation.
View details
Automatically Detecting Action Items in Audio Meeting Recordings
William Morgan
Surabhi Gupta
Jason M. Brenier
SIGdial, Association for Computational Linguistics, Sydney, Australia(2006), pp. 96-103
Preview abstract
Identification of action items in meeting recordings can provide immediate access to salient information in a medium notoriously difficult to search and summarize. To this end, we use a maximum entropy
model to automatically detect action item related utterances from multi-party audio meeting recordings. We compare the effect of lexical, temporal, syntactic, semantic, and prosodic features on system performance. We show that on a corpus of action item annotations on the ICSI meeting
recordings, characterized by high imbalance and low inter-annotator agreement,
the system performs at an F measure of 31.92%. While this is low compared to better-studied tasks on more mature corpora, the relative usefulness of the features towards this task is indicative of their usefulness on more consistent annotations, as well as to related tasks.
View details
A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005
Improved language model adaptation using existing and derived external resources
Lin-Shan Lee
IEEE Workshop on Automatic Speech Recognition and Understanding, U.S. Virgin Islands(2003)
Preview abstract
Adaptation of language models to obtain better parameters for the topics addressed by the spoken documents to be recognized has been a key issue for speech recognition. In this paper, we propose to collect existing as well as derived external resources for improved language model adaptation. The derived external resources are those retrieved based on the baseline transcriptions for the input spoken documents from the Internet using some search engine. The design of queries for such purposes are also analyzed in this paper, in which the special structure of Chinese language is considered. The obtained existing and derived external resources are then used in the model adaptation under a Clustering-Classification framework. Very encouraging results were obtained in the preliminary experiments with two test sets: broadcast news and interview recording.
View details
Improved Chinese Broadcast News Transcription by Language Modeling with Temporally Consistent Training Corpora and Iterative Phrase Extraction
Preview abstract
In this paper an iterative Chinese new phrase extraction method based on the intra-phrase association and context variation statistics is proposed. A Chinese language model enhancement framework including lexicon expansion is then developed. Extensive experiments for Chinese broadcast news transcription were then performed to explore the achievable improvements with respect to the degree of temporal consistency for the adaptation corpora. Very encouraging results were obtained and detailed
analysis discussed.
View details