Samuel J. Yang

Samuel J. Yang

Samuel J. Yang is a research scientist on the Google Accelerated Science team. Prior to that, he got his Ph.D. from Stanford University, working on computational imaging & display and computational microscopy for systems neuroscience applications.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a fivepoint scale. We trained three models following different deep learning approaches and evaluated them on ∼94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers, ∼2300 samples). View details
    Preview abstract Understanding speech in the presence of noise with hearing aids can be challenging. Here we describe our entry, submission E003, to the 2021 Clarity Enhancement Challenge Round1 (CEC1), a machine learning challenge for improving hearing aid processing. We apply and evaluate a deep neural network speech enhancement model with a low-latency recursive least squares (RLS) adaptive beamformer, and a linear equalizer, to improve speech intelligibility in the presence of speech or noise interferers. The enhancement network is trained only on the CEC1 data, and all processing obeys the 5 ms latency requirement. We quantify the improvement using the CEC1 provided hearing loss model and Modified Binaural Short-Time Objective Intelligibility (MBSTOI) score (ranging from 0 to 1, higher being better). On the CEC1 test set, we achieve a mean of 0.644 and median of 0.652 compared to the 0.310 mean and 0.314 median for the baseline. In the CEC1 subjective listener intelligibility assessment, for scenes with noise interferers, we achieve the second highest improvement in intelligibility from 33.2% to 85.5%, but for speech interferers, we see more mixed results, potentially from listener confusion. View details
    Discovery of complex oxides via automated experiments and data science
    Joel A Haber
    Zan Armstrong
    Kevin Kan
    Lan Zhou
    Matthias H Richter
    Christopher Roat
    Nicholas Wagner
    Patrick Francis Riley
    John M Gregoire
    Proceedings of the Natural Academy of Sciences (2021)
    Preview abstract The quest to identify materials with tailored properties is increasingly expanding into high-order composition spaces, where materials discovery efforts have been met with the dual challenges of a combinatorial explosion in the number of candidate materials and a lack of predictive computation to guide experiments. The traditional approach to predictive materials science involves establishing a model that maps composition and structure to properties. We explore an inverse approach wherein a data science workflow uses high throughput measurements of optical properties to identify the composition spaces with interesting materials science. By identifying composition regions whose optical trends cannot be explained by trivial phase behavior, the data science pipeline identifies candidate combinations of elements that form 3-cation metal oxide phases. The identification of such novel phase behavior elevates the measurement of optical properties to the discovery of materials with complex phase-dependent properties. This conceptual workflow is illustrated with Co-Ta-Sn oxides wherein a new rutile alloy is discovered via data science guidance from the high throughput optical characterization. The composition-tuned properties of the rutile oxide alloys include transparency, catalytic activity, and stability in strong acid electrolytes. In addition to the unprecedented mapping of optical properties in 108 unique 3-cation oxide composition spaces, we present a critical discussion of coupling data validation to experiment design to generate a reliable end-to-end high throughput workflow for accelerating scientific discovery. View details
    Preview abstract Scanning Electron Microscopes (SEM) and Dual Beam Focused Ion Beam Microscopes (FIB-SEM) are essential tools used in the semiconductor industry and in relation to this work, for wafer inspection in the production of hard drives at Seagate. These microscopes provide essential metrology during the build and help determine process bias and control. However, these microscopes will naturally drift out of focus over time, and if not immediately detected the consequences of this include: incorrect measurements, scrap, wasted resources, tool down time and ultimately delays in production. This paper presents an automated solution that uses deep learning to remove anomalous images and determine the degree of blurriness for SEM and FIB-SEM images. Since its first deployment, the first of its kind at Seagate, it has replaced the need for manual inspection on the covered processes and mitigated delays in production, realizing return on investment in the order of millions of US dollars annually in both cost savings and cost avoidance. The proposed solution can be broken into two deep learning steps. First, we train a deep convolutional neural network, a RetinaNet object detector, to detect and locate a Region Of Interest (ROI) containing the main feature of the image. For the second step, we train another deep convolutional neural network using the ROI, to determine the sharpness of the image. The second model identifies focus level based on a training dataset consisting of synthetically degraded in- focus images, based on work by Google Research, achieving up to 99.3% test set accuracy. View details
    Preview abstract Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a function mapping the images to an embedding space of manageable dimensionality whose geometry captures relevant features of the input images. An important known issue for such methods is separating relevant biological signal from nuisance variation. For example, the embedding vectors tend to be more correlated for cells that were cultured and imaged during the same week than for those from different weeks, despite having identical drug compounds applied in both cases. In this case, the particular batch in which a set of experiments were conducted constitutes the domain of the data; an ideal set of image embeddings should contain only the relevant biological information (e.g., drug effects). We develop a general framework for adjusting the image embeddings in order to “forget” domain-specific information while preserving relevant biological information. To achieve this, we minimize a loss function based on distances between marginal distributions (such as the Wasserstein distance) of embeddings across domains for each replicated treatment. For the dataset we present results with, the only replicated treatment happens to be the negative control treatment, for which we do not expect any treatment-induced cell morphology changes. We find that for our transformed embeddings (i) the underlying geometric structure is not only preserved but the embeddings also carry improved biological signal; and (ii) less domain-specific information is present. View details
    It's easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets
    Arunachalam Narayanaswamy
    Anton Geraschenko
    Scott Lipnick
    Nina Makhortova
    James Hawrot
    Christine Marques
    Joao Pereira
    Lee Rubin
    Brian Wainger,
    NeurIPS LMRL workshop 2019 (2019)
    Preview abstract Confounding variables are a well known source of nuisance in biomedical studies. They present an even greater challenge when we combine them with black-box machine learning techniques that operate on raw data. This work presents two case studies. In one, we discovered biases arising from systematic errors in the data generation process. In the other, we found a spurious source of signal unrelated to the prediction task at hand. In both cases, our prediction models performed well but under careful examination hidden confounders and biases were revealed. These are cautionary tales on the limits of using machine learning techniques on raw data from scientific experiments. View details
    Applying Deep Neural Network Analysis to High-Content Image-Based Assays
    Scott L. Lipnick
    Nina R. Makhortova
    Minjie Fan
    Zan Armstrong
    Thorsten M. Schlaeger
    Liyong Deng
    Wendy K. Chung
    Liadan O'Callaghan
    Anton Geraschenko
    Dosh Whye
    Jon Hazard
    Arunachalam Narayanaswamy
    D. Michael Ando
    Lee L. Rubin
    SLAS DISCOVERY: Advancing Life Sciences R\&D, 0 (2019), pp. 2472555219857715
    Preview abstract The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk factors. To overcome these challenges, researchers are integrating novel data types from numerous patients, including imaging studies capturing broadly applicable features from patient-derived materials. These datasets, when combined with machine learning, potentially hold the power to elucidate the subtle patterns that stratify patients by shared pathology. In this study, we interrogated whether high-content imaging of primary skin fibroblasts, using the Cell Painting method, could reveal disease-relevant information among patients. First, we showed that technical features such as batch/plate type, plate, and location within a plate lead to detectable nuisance signals, as revealed by a pre-trained deep neural network and analysis with deep image embeddings. Using a plate design and image acquisition strategy that accounts for these variables, we performed a pilot study with 12 healthy controls and 12 subjects affected by the severe genetic neurological disorder spinal muscular atrophy (SMA), and evaluated whether a convolutional neural network (CNN) generated using a subset of the cells could distinguish disease states on cells from the remaining unseen control–SMA pair. Our results indicate that these two populations could effectively be differentiated from one another and that model selectivity is insensitive to batch/plate type. One caveat is that the samples were also largely separated by source. These findings lay a foundation for how to conduct future studies exploring diseases with more complex genetic contributions and unknown subtypes. View details
    Assessing microscope image focus quality with deep learning
    D. Michael Ando
    Mariya Barch
    Arunachalam Narayanaswamy
    Eric Christiansen
    Chris Roat
    Jane Hung
    Curtis T. Rueden
    Asim Shankar
    Steven Finkbeiner
    BMC Bioinformatics, 19 (2018), pp. 77
    Preview abstract Background: Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality. Results: We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument. Conclusions: Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler. View details
    In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
    Eric Christiansen
    Mike Ando
    Ashkan Javaherian
    Gaia Skibinski
    Scott Lipnick
    Elliot Mount
    Alison O'Neil
    Kevan Shah
    Alicia K. Lee
    Piyush Goyal
    Liam Fedus
    Andre Esteva
    Lee Rubin
    Steven Finkbeiner
    Cell (2018)
    Preview abstract Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly. View details