Marc Berndl
Marc Berndl has been at Google since 2005, has a Master’s Degree in Computer Science from McGill University, and is Engineering Lead for Google Accelerated Science. Marc spent eight years in Ads working on auction theory, data analysis as well as experimental design. Within GAS, Marc has established ongoing research efforts in material science, biochemistry, cell biology, and drug screening.
His current research includes predictive model semantics, solar thermal energy optimization, aptamer design, and methods of detection, localization and quantification of cellular proteins.
Research Areas
Authored Publications
Sort By
Longitudinal fundus imaging and its genome-wide association analysis provides evidence for a human retinal aging clock
Sara Ahadi
Kenneth A Wilson Jr,
Orion Pritchard
Ajay Kumar
Enrique M Carrera
Ricardo Lamy
Jay M Stewart
Avinash Varadarajan
Pankaj Kapahi
Ali Bashir
eLife (2023)
Preview abstract
Background
Biological age, distinct from an individual’s chronological age, has been studied extensively through predictive aging clocks. However, these clocks have limited accuracy in short time-scales. Deep learning approaches on imaging datasets of the eye have proven powerful for a variety of quantitative phenotype inference and provide an opportunity to explore organismal aging and tissue health.
Methods
Here we trained deep learning models on fundus images from the EyePacs dataset to predict individuals’ chronological age. These predictions lead to the concept of a retinal aging clock which we then employed for a series of downstream longitudinal analyses. The retinal aging clock was used to assess the predictive power of aging inference, termed eyeAge, on short time-scales using longitudinal fundus imaging data from a subset of patients. Additionally, the model was applied to a separate cohort from the UK Biobank to validate the model and perform a GWAS. The top candidate gene was then tested in a fly model of eye aging.
Findings
EyeAge was able to predict the age with a mean absolute error of 3.26 years, which is much less than other aging clocks. Additionally, eyeAge was highly independent of blood marker-based measures of biological age (e.g. “phenotypic age”), maintaining a hazard ratio of 1.026 even in the presence of phenotypic age. Longitudinal studies showed that the resulting models were able to predict individuals’ aging, in time-scales less than a year with 71% accuracy. Notably, we observed a significant individual-specific component to the prediction. This observation was confirmed with the identification of multiple GWAS hits in the independent UK Biobank cohort. The knockdown of the top hit, ALKAL2, which was previously shown to extend lifespan in flies, also slowed age-related decline in vision in flies.
Interpretation
In conclusion, predicted age from retinal images can be used as a biomarker of biological aging in a given individual independently from phenotypic age. This study demonstrates the utility of retinal aging clock for studying aging and age-related diseases and quantitatively measuring aging on very short time-scales, potentially opening avenues for quick and actionable evaluation of gero-protective therapeutics.
View details
ProtSeq: towards high-throughput, single-molecule protein sequencing via amino acid conversion into DNA barcodes
Jessica Hong
Michael Connor Gibbons
Ali Bashir
Diana Wu
Shirley Shao
Zachary Cutts
Mariya Chavarha
Ye Chen
Lauren Schiff
Mikelle Foster
Victoria Church
Llyke Ching
Sara Ahadi
Anna Hieu-Thao Le
Alexander Tran
Michelle Therese Dimon
Phillip Jess
iScience, 25 (2022), pp. 32
Preview abstract
We demonstrate early progress toward constructing a high-throughput, single-molecule protein sequencing technology utilizing barcoded DNA aptamers (binders) to recognize terminal amino acids of peptides (targets) tethered on a next-generation sequencing chip. DNA binders deposit unique, amino acid identifying barcodes on the chip. The end goal is that over multiple binding cycles, a sequential chain of DNA barcodes will identify the amino acid sequence of a peptide. Toward this, we demonstrate successful target identification with two sets of target-binder pairs: DNA-DNA and Peptide-Protein. For DNA-DNA binding, we show assembly and sequencing of DNA barcodes over 6 consecutive binding cycles. Intriguingly, our computational simulation predicts that a small set of semi-selective DNA binders offers significant coverage of the human proteome. Toward this end, we introduce a binder discovery pipeline that ultimately could merge with the chip assay into a technology called ProtSeq, for future high-throughput, single-molecule protein sequencing.
View details
Machine learning guided aptamer discovery
Ali Bashir
Geoff Davis
Michelle Therese Dimon
Qin Yang
Scott Ferguson
Zan Armstrong
Nature Communications (2021)
Preview abstract
Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents.
View details
Discovery of complex oxides via automated experiments and data science
Joel A Haber
Zan Armstrong
Kevin Kan
Lan Zhou
Matthias H Richter
Christopher Roat
Nicholas Wagner
Patrick Francis Riley
John M Gregoire
Proceedings of the Natural Academy of Sciences (2021)
Preview abstract
The quest to identify materials with tailored properties is increasingly expanding into high-order composition spaces, where materials discovery efforts have been met with the dual challenges of a combinatorial explosion in the number of candidate materials and a lack of predictive computation to guide experiments. The traditional approach to predictive materials science involves establishing a model that maps composition and structure to properties. We explore an inverse approach wherein a data science workflow uses high throughput measurements of optical properties to identify the composition spaces with interesting materials science. By identifying composition regions whose optical trends cannot be explained by trivial phase behavior, the data science pipeline identifies candidate combinations of elements that form 3-cation metal oxide phases. The identification of such novel phase behavior elevates the measurement of optical properties to the discovery of materials with complex phase-dependent properties. This conceptual workflow is illustrated with Co-Ta-Sn oxides wherein a new rutile alloy is discovered via data science guidance from the high throughput optical characterization. The composition-tuned properties of the rutile oxide alloys include transparency, catalytic activity, and stability in strong acid electrolytes. In addition to the unprecedented mapping of optical properties in 108 unique 3-cation oxide composition spaces, we present a critical discussion of coupling data validation to experiment design to generate a reliable end-to-end high throughput workflow for accelerating scientific discovery.
View details
Preview abstract
We present IDEA (the Induction Dynamics gene Expression Atlas), a dataset constructed by independently inducing hundreds of transcription factors (TFs) and measuring timecourses of the resulting gene expression responses in budding yeast. Each experiment captures a regulatory cascade connecting a single induced regulator to the genes it causally regulates. We discuss the regulatory cascade of a single TF, Aft1, in detail; however, IDEA contains > 200 TF induction experiments with 20 million individual observations and 100,000 signal‐containing dynamic responses. As an application of IDEA, we integrate all timecourses into a whole‐cell transcriptional model, which is used to predict and validate multiple new and underappreciated transcriptional regulators. We also find that the magnitudes of coefficients in this model are predictive of genetic interaction profile similarities. In addition to being a resource for exploring regulatory connectivity between TFs and their target genes, our modeling approach shows that combining rapid perturbations of individual genes with genome‐scale time‐series measurements is an effective strategy for elucidating gene regulatory networks.
View details
It's easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets
Arunachalam Narayanaswamy
Anton Geraschenko
Scott Lipnick
Nina Makhortova
James Hawrot
Christine Marques
Joao Pereira
Lee Rubin
Brian Wainger,
NeurIPS LMRL workshop 2019 (2019)
Preview abstract
Confounding variables are a well known source of nuisance in biomedical studies. They present an even greater challenge when we combine them with black-box machine learning techniques that operate on raw data. This work presents two case studies. In one, we discovered biases arising from systematic errors in the data generation process. In the other, we found a spurious source of signal unrelated to the prediction task at hand. In both cases, our prediction models performed well but under careful examination hidden confounders and biases were revealed. These are cautionary tales on the limits of using machine learning techniques on raw data from scientific experiments.
View details
Applying Deep Neural Network Analysis to High-Content Image-Based Assays
Scott L. Lipnick
Nina R. Makhortova
Minjie Fan
Zan Armstrong
Thorsten M. Schlaeger
Liyong Deng
Wendy K. Chung
Liadan O'Callaghan
Anton Geraschenko
Dosh Whye
Jon Hazard
Arunachalam Narayanaswamy
D. Michael Ando
Lee L. Rubin
SLAS DISCOVERY: Advancing Life Sciences R\&D, 0 (2019), pp. 2472555219857715
Preview abstract
The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk factors. To overcome these challenges, researchers are integrating novel data types from numerous patients, including imaging studies capturing broadly applicable features from patient-derived materials. These datasets, when combined with machine learning, potentially hold the power to elucidate the subtle patterns that stratify patients by shared pathology. In this study, we interrogated whether high-content imaging of primary skin fibroblasts, using the Cell Painting method, could reveal disease-relevant information among patients. First, we showed that technical features such as batch/plate type, plate, and location within a plate lead to detectable nuisance signals, as revealed by a pre-trained deep neural network and analysis with deep image embeddings. Using a plate design and image acquisition strategy that accounts for these variables, we performed a pilot study with 12 healthy controls and 12 subjects affected by the severe genetic neurological disorder spinal muscular atrophy (SMA), and evaluated whether a convolutional neural network (CNN) generated using a subset of the cells could distinguish disease states on cells from the remaining unseen control–SMA pair. Our results indicate that these two populations could effectively be differentiated from one another and that model selectivity is insensitive to batch/plate type. One caveat is that the samples were also largely separated by source. These findings lay a foundation for how to conduct future studies exploring diseases with more complex genetic contributions and unknown subtypes.
View details
In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
Eric Christiansen
Mike Ando
Ashkan Javaherian
Gaia Skibinski
Scott Lipnick
Elliot Mount
Alison O'Neil
Kevan Shah
Alicia K. Lee
Piyush Goyal
Liam Fedus
Andre Esteva
Lee Rubin
Steven Finkbeiner
Cell (2018)
Preview abstract
Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly.
View details
Assessing microscope image focus quality with deep learning
D. Michael Ando
Mariya Barch
Arunachalam Narayanaswamy
Eric Christiansen
Chris Roat
Jane Hung
Curtis T. Rueden
Asim Shankar
Steven Finkbeiner
BMC Bioinformatics, 19 (2018), pp. 77
Preview abstract
Background: Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality.
Results: We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument.
Conclusions: Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler.
View details
Preview abstract
Image-based screening is a powerful technique to reveal how chemical, genetic, and environmental perturbations affect cellular state. Its potential is restricted by the current analysis algorithms that target a small number of cellular phenotypes and rely on expert-engineered image features. Newer algorithms that learn how to represent an image are limited by the small amount of labeled data for ground-truth, a common problem for scientific projects. We demonstrate a sensitive and robust method for distinguishing cellular phenotypes that requires no additional ground-truth data or training. It achieves state-of-the-art performance classifying drugs by similar molecular mechanism, using a Deep Metric Network that has been pre-trained on consumer images and a transformation that improves sensitivity to biological variation. However, our method is not limited to classification into predefined categories. It provides a continuous measure of the similarity between cellular phenotypes that can also detect subtle differences such as from increasing dose. The rich, biologically-meaningful image representation that our method provides can help therapy development by supporting high-throughput investigations, even exploratory ones, with more sophisticated and disease-relevant models.
View details