Samuel J. Yang
Samuel J. Yang is a research scientist on the Google Accelerated Science team. Prior to that, he got his Ph.D. from Stanford University, working on computational imaging & display and computational microscopy for systems neuroscience applications.
Research Areas
Authored Publications
Sort By
Speech Intelligibility Classifiers from 550k Disordered Speech Samples
Katie Seaver
Richard Cave
Neil Zeghidour
Rus Heywood
Jordan Green
ICASSP, Icassp submission. 2022 (2023)
Preview abstract
We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a fivepoint scale. We trained three models following different deep learning approaches and evaluated them on ∼94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers, ∼2300 samples).
View details
Preview abstract
Understanding speech in the presence of noise with hearing aids can be challenging. Here we describe our entry, submission E003, to the 2021 Clarity Enhancement Challenge Round1 (CEC1), a machine learning challenge for improving hearing aid processing. We apply and evaluate a deep neural network speech enhancement model with a low-latency recursive least squares (RLS) adaptive beamformer, and a linear equalizer, to improve speech intelligibility in the presence of speech or noise interferers. The enhancement network is trained only on the CEC1 data, and all processing obeys the 5 ms latency requirement. We quantify the improvement using the CEC1 provided hearing loss model and Modified Binaural Short-Time Objective Intelligibility (MBSTOI) score (ranging from 0 to 1, higher being better). On the CEC1 test set, we achieve a mean of 0.644 and median of 0.652 compared to the 0.310 mean and 0.314 median for the baseline. In the CEC1 subjective listener intelligibility assessment, for scenes with noise interferers, we achieve the second highest improvement in intelligibility from 33.2% to 85.5%, but for speech interferers, we see more mixed results, potentially from listener confusion.
View details
Discovery of complex oxides via automated experiments and data science
Joel A Haber
Zan Armstrong
Kevin Kan
Lan Zhou
Matthias H Richter
Christopher Roat
Nicholas Wagner
Patrick Francis Riley
John M Gregoire
Proceedings of the Natural Academy of Sciences (2021)
Preview abstract
The quest to identify materials with tailored properties is increasingly expanding into high-order composition spaces, where materials discovery efforts have been met with the dual challenges of a combinatorial explosion in the number of candidate materials and a lack of predictive computation to guide experiments. The traditional approach to predictive materials science involves establishing a model that maps composition and structure to properties. We explore an inverse approach wherein a data science workflow uses high throughput measurements of optical properties to identify the composition spaces with interesting materials science. By identifying composition regions whose optical trends cannot be explained by trivial phase behavior, the data science pipeline identifies candidate combinations of elements that form 3-cation metal oxide phases. The identification of such novel phase behavior elevates the measurement of optical properties to the discovery of materials with complex phase-dependent properties. This conceptual workflow is illustrated with Co-Ta-Sn oxides wherein a new rutile alloy is discovered via data science guidance from the high throughput optical characterization. The composition-tuned properties of the rutile oxide alloys include transparency, catalytic activity, and stability in strong acid electrolytes. In addition to the unprecedented mapping of optical properties in 108 unique 3-cation oxide composition spaces, we present a critical discussion of coupling data validation to experiment design to generate a reliable end-to-end high throughput workflow for accelerating scientific discovery.
View details
Preview abstract
Scanning Electron Microscopes (SEM) and Dual Beam Focused Ion Beam Microscopes (FIB-SEM) are essential tools used in the semiconductor industry and in relation to this work, for wafer inspection in the production of hard drives at Seagate. These microscopes provide essential metrology during the build and help determine process bias and control. However, these microscopes will naturally drift out of focus over time, and if not immediately detected the consequences of this include: incorrect measurements, scrap, wasted resources, tool down time and ultimately delays in production.
This paper presents an automated solution that uses deep learning to remove anomalous images and determine the degree of blurriness for SEM and FIB-SEM images. Since its first deployment, the first of its kind at Seagate, it has replaced the need for manual inspection on the covered processes and mitigated delays in production, realizing return on investment in the order of millions of US dollars annually in both cost savings and cost avoidance.
The proposed solution can be broken into two deep learning steps. First, we train a deep convolutional neural network, a RetinaNet object detector, to detect and locate a Region Of Interest (ROI) containing the main feature of the image. For the second step, we train another deep convolutional neural network using the ROI, to determine the sharpness of the image. The second model identifies focus level based on a training dataset consisting of synthetically degraded in- focus images, based on work by Google Research, achieving up to 99.3% test set accuracy.
View details
Preview abstract
Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a function mapping the images to an embedding space of manageable dimensionality whose geometry captures relevant features of the input images. An important known issue for such methods is separating relevant biological signal from nuisance variation. For example, the embedding vectors tend to be more correlated for cells that were cultured and imaged during the same week than for those from different weeks, despite having identical drug compounds applied in both cases. In this case, the particular batch in which a set of experiments were conducted constitutes the domain of the data; an ideal set of image embeddings should contain only the relevant biological information (e.g., drug effects). We develop a general framework for adjusting the image embeddings in order to “forget” domain-specific information while preserving relevant biological information. To achieve this, we minimize a loss function based on distances between marginal distributions (such as the Wasserstein distance) of embeddings across domains for each replicated treatment. For the dataset we present results with, the only replicated treatment happens to be the negative control treatment, for which we do not expect any treatment-induced cell morphology changes. We find that for our transformed embeddings (i) the underlying geometric structure is not only preserved but the embeddings also carry improved biological signal; and (ii) less domain-specific information is present.
View details
It's easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets
Arunachalam Narayanaswamy
Anton Geraschenko
Scott Lipnick
Nina Makhortova
James Hawrot
Christine Marques
Joao Pereira
Lee Rubin
Brian Wainger,
NeurIPS LMRL workshop 2019 (2019)
Preview abstract
Confounding variables are a well known source of nuisance in biomedical studies. They present an even greater challenge when we combine them with black-box machine learning techniques that operate on raw data. This work presents two case studies. In one, we discovered biases arising from systematic errors in the data generation process. In the other, we found a spurious source of signal unrelated to the prediction task at hand. In both cases, our prediction models performed well but under careful examination hidden confounders and biases were revealed. These are cautionary tales on the limits of using machine learning techniques on raw data from scientific experiments.
View details
Applying Deep Neural Network Analysis to High-Content Image-Based Assays
Scott L. Lipnick
Nina R. Makhortova
Minjie Fan
Zan Armstrong
Thorsten M. Schlaeger
Liyong Deng
Wendy K. Chung
Liadan O'Callaghan
Anton Geraschenko
Dosh Whye
Jon Hazard
Arunachalam Narayanaswamy
D. Michael Ando
Lee L. Rubin
SLAS DISCOVERY: Advancing Life Sciences R\&D, 0 (2019), pp. 2472555219857715
Preview abstract
The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk factors. To overcome these challenges, researchers are integrating novel data types from numerous patients, including imaging studies capturing broadly applicable features from patient-derived materials. These datasets, when combined with machine learning, potentially hold the power to elucidate the subtle patterns that stratify patients by shared pathology. In this study, we interrogated whether high-content imaging of primary skin fibroblasts, using the Cell Painting method, could reveal disease-relevant information among patients. First, we showed that technical features such as batch/plate type, plate, and location within a plate lead to detectable nuisance signals, as revealed by a pre-trained deep neural network and analysis with deep image embeddings. Using a plate design and image acquisition strategy that accounts for these variables, we performed a pilot study with 12 healthy controls and 12 subjects affected by the severe genetic neurological disorder spinal muscular atrophy (SMA), and evaluated whether a convolutional neural network (CNN) generated using a subset of the cells could distinguish disease states on cells from the remaining unseen control–SMA pair. Our results indicate that these two populations could effectively be differentiated from one another and that model selectivity is insensitive to batch/plate type. One caveat is that the samples were also largely separated by source. These findings lay a foundation for how to conduct future studies exploring diseases with more complex genetic contributions and unknown subtypes.
View details
Assessing microscope image focus quality with deep learning
D. Michael Ando
Mariya Barch
Arunachalam Narayanaswamy
Eric Christiansen
Chris Roat
Jane Hung
Curtis T. Rueden
Asim Shankar
Steven Finkbeiner
BMC Bioinformatics, 19 (2018), pp. 77
Preview abstract
Background: Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality.
Results: We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument.
Conclusions: Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler.
View details
In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
Eric Christiansen
Mike Ando
Ashkan Javaherian
Gaia Skibinski
Scott Lipnick
Elliot Mount
Alison O'Neil
Kevan Shah
Alicia K. Lee
Piyush Goyal
Liam Fedus
Andre Esteva
Lee Rubin
Steven Finkbeiner
Cell (2018)
Preview abstract
Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly.
View details