September 27, 2024
Mercy Asiedu, Research Scientist, Google Research, and Nichole Young-Lin, Clinical Lead, Google Health
We present our work on developing and evaluating a machine learning model for cardiotocography, to predict fetal well-being, and to understand what factors influence model performance.
Cardiotocography (CTG) is a doppler ultrasound–based technique used during pregnancy and labor to monitor fetal well-being by recording fetal heart rate (FHR) and uterine contractions (UC). CTG can be done continuously or intermittently, with leads placed either externally or internally. External CTG involves the use of two sensors placed on the birthing parent’s belly: an ultrasound transducer placed above the fetal heart position to monitor FHR, and a tocodynamometer (pressure sensor) placed on the fundus of the uterus to measure UC.
Currently, providers interpret CTG recordings using guidelines like those from the National Institute of Child Health and Human Development (NICHD; guidelines) or the International Federation of Gynecologists and Obstetricians (FIGO; guidelines). These standards define different patterns in the CTG and FHR traces that may indicate fetal distress.
Today we present work from our recent paper, ”Development and evaluation of deep learning models for cardiotocography interpretation”, in which we describe research on our new machine learning (ML) model that will provide objective interpretation assistance to health providers to reduce burden and potentially improve fetal outcomes. Using an open-source CTG dataset, we develop end-to-end neural network-based models to predict measures of fetal well-being, including both objective (fetal arterial cord blood pH, i.e., fetal acidosis) and subjective (fetal Apgar scores) measures. Given the potential high stakes nature of the use-case if utilized in a clinical setting, we perform extensive evaluations to examine how the model performs with varying inputs, including FHR only, FHR+UC, and FHR+UC+Metadata.
This work builds a model development and evaluation pipeline to enable fetal well-being prediction that takes into account limited data, clinical metadata, and intermittent methods used in low-resource settings.
Presently, CTG and ultrasound are the primary means of evaluating fetal well-being in utero. Although CTG is routinely used in medical practice, its application to continuous intrapartum fetal monitoring is associated with a high false-positive rate with limited demonstration of improvements in fetal outcomes. This high false-positive rate has led to an increase in cesarean section and operative vaginal delivery rates with limited improvements in neonatal outcomes. This is likely due to the complexity of reading and interpreting the fetal heart tracings, and the subjective nature of visual interpretation methods and intra- and inter-observer variability when reading the tracings. These issues are exacerbated in low-resource facilities where access to skilled interpreters is even more limited.
Top: Challenges of current visual CTG interpretation. Bottom: Proposed clinical use-case of deep learning algorithms for CTG interpretation and assistive clinical decision-making.
Current research methods using ML algorithms to classify abnormal CTGs typically use tabulated rules-based extraction of diagnostic features, such as summary statistics of fetal heart rate. While this approach has shown promise for improving clinical decision support, feature extraction reduces rich CTG information from the time series data. So, for CTG interpretation, there is a recent shift in focus to deep learning methods that use physiological time series data as input [1, 2, 3]. However these methods do not typically compare performance differences between objective and subjective ground truth labels, and do not explore the effects of intermittent measurements or clinical metadata.
Using an open license dataset, CTU-UHB Intrapartum Cardiotocography Database, which has 552 FHR and UC CTG signal pairs up to 90 minutes before delivery for a total of ~50,000 minutes of recordings:
We begin with the CTG-net network architecture, which convolves the paired FHR and UC input signals temporally before conducting a depthwise convolution to learn the relationship between them. We add the following methodological configurations:
Development pipeline for the model, which uses FHR and UC inputs and generates a predicted outcome depending on the classification task.
We create a pre-processing pipeline for input signals to improve data quality, smooth the signal, and account for gaps. This includes inputting missing measurements, random cropping (for pre-training and specific training evaluations), and additive multiscale noise for data augmentation and downsampling. This generates 4.3M minutes (n=496 patients) of signals for pre-training, ~150k minutes (n=496 patients) for training, and ~1,700 minutes (n=56 patients) for testing.
Pre-processing pipeline.
Given the limited number of patients in the open license dataset (n=552), we pre-train the model on cropped signal segments before the last 30 minutes and then fine-tune on the last 30 minutes of the test set, which we use as our primary time point of interest.
Pre-training and fine-tuning pipeline used to improve performance on small datasets.
CTG use comes in two primary formats, intermittent and continuous. In most high-resource settings, continuous CTGs are used in the clinic throughout labor to continuously monitor fetal heart rate. These typically digital signals record uterine contractions and the fetal heart rate. However in low-resource settings, intermittent CTGs are often used, which may cover only about 30 minutes at any point during labor, and are then printed out for interpretation by the provider.
The open source data from CTU-UHB database came from a continuous CTG setting, in contrast with intermittent analog CTGs typically seen in low-resource settings. One of our key contributions is to understand how training and evaluating on intermittent time points impacts the model performance. We simulated intermittent settings as part of our evaluation process by splitting the 90 minute signals in the dataset into 30 minute signals and training and evaluating the model at different time points.
Intermittent CTG measurements are more likely to be the case in low-resource settings. Intermittent CTG is a 20–30 minute CTG measurement during labor at an unknown time point before birth. We simulate this setting in our evaluation to understand how models either trained or tested on intermittent datasets perform.
Another key methodological contribution is our use of three outcome labels from the dataset:
For evaluation purposes we perform the following comparisons.
We find that our approach performs comparably to the reported AUROC in CTG-net, even though it is trained on a smaller dataset. When we train and evaluate both methods on the same dataset we find that our method improves model performance by 10+ AUROC percentage points.
| AUROC | ||
|
CTG-net* |
0.68 ± 0.03 |
– |
|
CTG-net (on the same dataset we used) |
0.57 ± 0.08 |
– |
|
Our Model† |
0.68 ± 0.07 |
0.27 (0.18) |
|
– |
0.45 (95% CI: 0.23-0.68) |
This table compares performance for the models, and clinicians on CTG data for binary classification tasks of abnormal fetal status (pH < 7.2 or Apgar at 1-min < 7). *=Mean and std reported over 10 random training seeds, †=Mean and std reported over 1,000 bootstrap samples
We find that combining FHR+UC achieves the highest model performance for both pH and Apgar classification. For the intermittent training and evaluation tasks, we find that the Apgar prediction task exhibits less robustness and more variability across different trained time periods, whereas the objective pH value is more stable. We also find that the pre-training step enables the highest model performance. Adding clinical metadata, such as maternal age and health status (e.g., pre-eclampsia / gestational diabetes), slightly improves model performance for pH, but less so for Apgar.
AUROC for pH (top) and Apgar (bottom) classification models evaluated on a cropped signal from different time intervals. Error bars depict the standard error, computed over 1,000 bootstrap samples. Markers to the left of the vertical gray line indicate the paradigm used to train and evaluate the baseline models.
We found significant differences in baseline performance between subgroups with frequent and infrequent UC signals gaps (UC missing) for pH prediction and for subgroups with frequent and infrequent FHR signal gaps (FHR missing) for Apgar prediction. With metadata, the performance disparities observed with pH prediction were mitigated. However, including metadata increased the AUROC performance disparities for demographic and clinical-related subgroups on this task.
Subgroup AUROC performance for the pH (top) and Apgar (bottom) classification baseline models. Error bars depict the standard error, computed over 1,000 bootstrap samples.
We are currently exploring open-sourcing our models in hopes that other researchers and stakeholders can build on this work with their own datasets to evaluate it for their clinical use cases, keeping in mind the limitations described below.
This study had limitations that constrain the generalizability of our findings. First, we used de-identified open-source CTGs from 552 patients at a single hospital in Prague, Czech Republic. To enhance the robustness of our findings, future investigations should involve a larger and more diverse dataset sourced from maternity centers worldwide, encompassing varied clinical contexts, demographics, and outcomes. Secondly, the absence of automated CTG digitization infrastructure in many resource-limited settings necessitates the simulation of intermittent CTG use cases from facilities with digitized recordings. Additionally, our study did not include a comparison of algorithmic performance against clinicians viewing the same dataset, prompting future research to explore different human and algorithmic use combinations. Finally, further work is needed to understand how such prediction algorithms can be optimally integrated into clinical workflows to improve neonatal outcomes.
We would like to acknowledge Dr. Kwaku Asah-Opoku who inspired this work. We would also like to acknowledge Nicole Chiou who worked passionately on this project during her internship at Google, and core contributors: Mercy Asiedu, Nichole Young-Lin, Christopher Kelly, Tiya Tiyasirichokchai, Abdoulaye Diack, Julie Cattiau, Sanmi Koyejo, and Katherine Heller. Thanks to Marian Croak for her support and leadership.