Predicting fetal well-being from cardiotocography signals using AI
September 27, 2024
Mercy Asiedu, Research Scientist, Google Research, and Nichole Young-Lin, Clinical Lead, Google Health
We present our work on developing and evaluating a machine learning model for cardiotocography, to predict fetal well-being, and to understand what factors influence model performance.
Quick links
Cardiotocography (CTG) is a doppler ultrasound–based technique used during pregnancy and labor to monitor fetal well-being by recording fetal heart rate (FHR) and uterine contractions (UC). CTG can be done continuously or intermittently, with leads placed either externally or internally. External CTG involves the use of two sensors placed on the birthing parent’s belly: an ultrasound transducer placed above the fetal heart position to monitor FHR, and a tocodynamometer (pressure sensor) placed on the fundus of the uterus to measure UC.
Currently, providers interpret CTG recordings using guidelines like those from the National Institute of Child Health and Human Development (NICHD; guidelines) or the International Federation of Gynecologists and Obstetricians (FIGO; guidelines). These standards define different patterns in the CTG and FHR traces that may indicate fetal distress.
Today we present work from our recent paper, ”Development and evaluation of deep learning models for cardiotocography interpretation”, in which we describe research on our new machine learning (ML) model that will provide objective interpretation assistance to health providers to reduce burden and potentially improve fetal outcomes. Using an open-source CTG dataset, we develop end-to-end neural network-based models to predict measures of fetal well-being, including both objective (fetal arterial cord blood pH, i.e., fetal acidosis) and subjective (fetal Apgar scores) measures. Given the potential high stakes nature of the use-case if utilized in a clinical setting, we perform extensive evaluations to examine how the model performs with varying inputs, including FHR only, FHR+UC, and FHR+UC+Metadata.
Improving CTG interpretation with deep neural networks
Presently, CTG and ultrasound are the primary means of evaluating fetal well-being in utero. Although CTG is routinely used in medical practice, its application to continuous intrapartum fetal monitoring is associated with a high false-positive rate with limited demonstration of improvements in fetal outcomes. This high false-positive rate has led to an increase in cesarean section and operative vaginal delivery rates with limited improvements in neonatal outcomes. This is likely due to the complexity of reading and interpreting the fetal heart tracings, and the subjective nature of visual interpretation methods and intra- and inter-observer variability when reading the tracings. These issues are exacerbated in low-resource facilities where access to skilled interpreters is even more limited.
Current research methods using ML algorithms to classify abnormal CTGs typically use tabulated rules-based extraction of diagnostic features, such as summary statistics of fetal heart rate. While this approach has shown promise for improving clinical decision support, feature extraction reduces rich CTG information from the time series data. So, for CTG interpretation, there is a recent shift in focus to deep learning methods that use physiological time series data as input [1, 2, 3]. However these methods do not typically compare performance differences between objective and subjective ground truth labels, and do not explore the effects of intermittent measurements or clinical metadata.
Using an open license dataset, CTU-UHB Intrapartum Cardiotocography Database, which has 552 FHR and UC CTG signal pairs up to 90 minutes before delivery for a total of ~50,000 minutes of recordings:
- We highlight the feasibility of using deep learning methods to predict fetal hypoxia from CTG.
- We conduct evaluation studies to analyze the effect of:
- The choice of objective (arterial umbilical cord blood pH) vs. subjective (Apgar score) labels,
- The signal time interval used for training and testing,
- The evaluation of simulated low-resource environment intermittent signals on predictive performance.
Building on the current state-of the art models for CTG interpretation
Model architecture
We begin with the CTG-net network architecture, which convolves the paired FHR and UC input signals temporally before conducting a depthwise convolution to learn the relationship between them. We add the following methodological configurations:
- Architecture and hyperparameter optimization: We run a randomized convolution parameter and hyperparameter search to select the optimal model configurations and hyperparameters for the tasks.
- Single input variation: We develop a 1D convolutional neural network variation of the model that takes in only one signal to compare performance with either FHR or UC individually and explore the use of the individual signals.
- Addition of metadata: We add the clinical metadata as a vector to the input.
Pre-processing
We create a pre-processing pipeline for input signals to improve data quality, smooth the signal, and account for gaps. This includes inputting missing measurements, random cropping (for pre-training and specific training evaluations), and additive multiscale noise for data augmentation and downsampling. This generates 4.3M minutes (n=496 patients) of signals for pre-training, ~150k minutes (n=496 patients) for training, and ~1,700 minutes (n=56 patients) for testing.
Pre-training
Given the limited number of patients in the open license dataset (n=552), we pre-train the model on cropped signal segments before the last 30 minutes and then fine-tune on the last 30 minutes of the test set, which we use as our primary time point of interest.
Intermittent versus continuous CTG use cases
CTG use comes in two primary formats, intermittent and continuous. In most high-resource settings, continuous CTGs are used in the clinic throughout labor to continuously monitor fetal heart rate. These typically digital signals record uterine contractions and the fetal heart rate. However in low-resource settings, intermittent CTGs are often used, which may cover only about 30 minutes at any point during labor, and are then printed out for interpretation by the provider.
The open source data from CTU-UHB database came from a continuous CTG setting, in contrast with intermittent analog CTGs typically seen in low-resource settings. One of our key contributions is to understand how training and evaluating on intermittent time points impacts the model performance. We simulated intermittent settings as part of our evaluation process by splitting the 90 minute signals in the dataset into 30 minute signals and training and evaluating the model at different time points.
Predicting objective and subjective ground truth labels
Another key methodological contribution is our use of three outcome labels from the dataset:
- The arterial umbilical cord blood pH is an objective measurement, typically available in high-resource settings, that tracks fetal acidosis, which is an indication of fetal distress. The pH was considered abnormal if the score was less than 7.2.
- The Apgar score is a subjective measure (ranging from 0–10) recorded by a clinician after delivery that reflects the general health of the newborn. Apgar scores are the primary delivery outcome descriptor in low-resource settings due to their simplicity, cost-effectiveness, and the potential financial burden of umbilical cord blood pH analysis. The 1-minute Apgar score was considered abnormal if the score was less than 7.
- A label of “abnormal” if either Apgar or pH results were abnormal.
Evaluating model prediction robustness
For evaluation purposes we perform the following comparisons.
- Our performance on the dataset versus the state-of-the-art CTG-net model
- Apgar versus pH classification tasks
- FHR-only versus FHR+UC
- Base model using last 30 minutes of labor (continuous case) versus intermittent measurements
- Base model of FHR+UC versus FHR+UC+Metadata
- Subgroup performance of the base model (FHR+UC) with subgroups determined by binarizing clinical metadata (e.g., low/high maternal age, low/high gestational weeks at birth, frequent/infrequent signal gaps)
We find that our approach performs comparably to the reported AUROC in CTG-net, even though it is trained on a smaller dataset. When we train and evaluate both methods on the same dataset we find that our method improves model performance by 10+ AUROC percentage points.
AUROC | ||
CTG-net* |
0.68 ± 0.03 |
– |
CTG-net (on the same dataset we used) |
0.57 ± 0.08 |
– |
Our Model† |
0.68 ± 0.07 |
0.27 (0.18) |
– |
0.45 (95% CI: 0.23-0.68) |
This table compares performance for the models, and clinicians on CTG data for binary classification tasks of abnormal fetal status (pH < 7.2 or Apgar at 1-min < 7). *=Mean and std reported over 10 random training seeds, †=Mean and std reported over 1,000 bootstrap samples
We find that combining FHR+UC achieves the highest model performance for both pH and Apgar classification. For the intermittent training and evaluation tasks, we find that the Apgar prediction task exhibits less robustness and more variability across different trained time periods, whereas the objective pH value is more stable. We also find that the pre-training step enables the highest model performance. Adding clinical metadata, such as maternal age and health status (e.g., pre-eclampsia / gestational diabetes), slightly improves model performance for pH, but less so for Apgar.
Subgroup evaluations
We found significant differences in baseline performance between subgroups with frequent and infrequent UC signals gaps (UC missing) for pH prediction and for subgroups with frequent and infrequent FHR signal gaps (FHR missing) for Apgar prediction. With metadata, the performance disparities observed with pH prediction were mitigated. However, including metadata increased the AUROC performance disparities for demographic and clinical-related subgroups on this task.
An open CTG model for research use cases
We are currently exploring open-sourcing our models in hopes that other researchers and stakeholders can build on this work with their own datasets to evaluate it for their clinical use cases, keeping in mind the limitations described below.
Limitations and future work
This study had limitations that constrain the generalizability of our findings. First, we used de-identified open-source CTGs from 552 patients at a single hospital in Prague, Czech Republic. To enhance the robustness of our findings, future investigations should involve a larger and more diverse dataset sourced from maternity centers worldwide, encompassing varied clinical contexts, demographics, and outcomes. Secondly, the absence of automated CTG digitization infrastructure in many resource-limited settings necessitates the simulation of intermittent CTG use cases from facilities with digitized recordings. Additionally, our study did not include a comparison of algorithmic performance against clinicians viewing the same dataset, prompting future research to explore different human and algorithmic use combinations. Finally, further work is needed to understand how such prediction algorithms can be optimally integrated into clinical workflows to improve neonatal outcomes.
Acknowledgements
We would like to acknowledge Dr. Kwaku Asah-Opoku who inspired this work. We would also like to acknowledge Nicole Chiou who worked passionately on this project during her internship at Google, and core contributors: Mercy Asiedu, Nichole Young-Lin, Christopher Kelly, Tiya Tiyasirichokchai, Abdoulaye Diack, Julie Cattiau, Sanmi Koyejo, and Katherine Heller. Thanks to Marian Croak for her support and leadership.