Jump to Content
Stephan Hoyer

Stephan Hoyer

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    WeatherBench 2: A benchmark for the next generation of data-driven global weather models
    Alex Merose
    Peter Battaglia
    Tyler Russell
    Alvaro Sanchez
    Vivian Yang
    Rob Carver
    Matthew Chantry
    Zied Ben Bouallegue
    Peter Dueben
    Carla Bromberg
    Jared Sisk
    Luke Barrington
    Aaron Bell
    arXiv (2023) (to appear)
    Preview abstract WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting. View details
    Machine learning guided aptamer discovery
    Ali Bashir
    Geoff Davis
    Michelle Therese Dimon
    Qin Yang
    Scott Ferguson
    Zan Armstrong
    Nature Communications (2021)
    Preview abstract Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents. View details
    Kohn-Sham equations as regularizer: building prior knowledge into machine-learned physics
    Li Li
    Ryan Pederson
    Patrick Francis Riley
    Kieron Burke
    Phys. Rev. Lett., vol. 126 (2021), pp. 036401
    Preview abstract Including prior knowledge is important for effective machine learning models in physics and is usually achieved by explicitly adding loss terms or constraints on model architectures. Prior knowledge embedded in the physics computation itself rarely draws attention. We show that solving the Kohn-Sham equations when training neural networks for the exchange-correlation functional provides an implicit regularization that greatly improves generalization. Two separations suffice for learning the entire one-dimensional H$_2$ dissociation curve within chemical accuracy, including the strongly correlated region. Our models also generalize to unseen types of molecules and overcome self-interaction error. View details
    Distributed Data Processing for Large-Scale Simulations on Cloud
    Lily Hu
    Yi-fan Chen
    Preview abstract In this work, we proposed a distributed data pipeline for large-scale simulations by using libraries and frameworks available on Cloud services. The data pipeline is designed with careful considerations for the characteristics of the simulation data. The implementation of the data pipeline is with Apache Beam and Zarr. Beam is a unified, open-source programming model for building both batch- and streaming-data parallel-processing pipelines. By using Beam, one can simply focus on the logical composition of the data processing task and bypass the low-level details of distributed computing. The orchestration of distributed processing is fully managed by the runner, in this work, Dataflow on Google Cloud. Beam separates the programming layer from the runtime layer such that the proposed pipeline can be executed across various runners. The storage format of the output tensor of the data pipeline is Zarr. Zarr allows concurrent reading and writing, storage on a file system, and data compression before the storage. The performance of the data pipeline is analyzed with an example, of which the simulation data is obtained with an in-house developed computational fluid dynamic solver running in parallel on Tensor Processing Unit (TPU) clusters. The performance analysis demonstrates good storage and computational efficiency of the proposed data pipeline. View details
    Machine learning accelerated computational fluid dynamics
    Ayya Alieva
    Dmitrii Kochkov
    Jamie Alexander Smith
    Proceedings of the National Academy of Sciences USA (2021)
    Preview abstract Numerical simulation of fluids plays an essential role in modeling many physical phenomena, such as in weather, climate, aerodynamics and plasma physics. Fluids are well described by the Navier-Stokes equations, but solving these equations at scale remains daunting, limited by the computational cost of resolving the smallest spatiotemporal features. This leads to unfavorable trade-offs between accuracy and tractability. Here we use end-to-end deep learning to improve approximations inside computational fluid dynamics for modeling two dimensional turbulent flows. For both direct numerical simulation of turbulence and large eddy simulation, our results are as accurate as baseline solvers with 8-16x finer resolution in each spatial dimension, resulting in a 40-400x fold computational speedups. Our method remains stable during long simulations, and generalizes to forcing functions and Reynolds numbers outside of the flows where it is trained, in contrast to black box machine learning approaches. Our approach exemplifies how scientific computing can leverage machine learning and hardware accelerators to improve simulations without sacrificing accuracy or generalization. View details
    Preview abstract Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a function mapping the images to an embedding space of manageable dimensionality whose geometry captures relevant features of the input images. An important known issue for such methods is separating relevant biological signal from nuisance variation. For example, the embedding vectors tend to be more correlated for cells that were cultured and imaged during the same week than for those from different weeks, despite having identical drug compounds applied in both cases. In this case, the particular batch in which a set of experiments were conducted constitutes the domain of the data; an ideal set of image embeddings should contain only the relevant biological information (e.g., drug effects). We develop a general framework for adjusting the image embeddings in order to “forget” domain-specific information while preserving relevant biological information. To achieve this, we minimize a loss function based on distances between marginal distributions (such as the Wasserstein distance) of embeddings across domains for each replicated treatment. For the dataset we present results with, the only replicated treatment happens to be the negative control treatment, for which we do not expect any treatment-induced cell morphology changes. We find that for our transformed embeddings (i) the underlying geometric structure is not only preserved but the embeddings also carry improved biological signal; and (ii) less domain-specific information is present. View details
    Free-Form Diffractive Metagrating Design Based on Generative Adversarial Networks
    Jiaqi Jiang
    David Sell
    Jason Hickey
    Jianji Yang
    Jonathan A Fan
    ACS Nano (2019)
    Preview abstract A key challenge in metasurface design is the development of algorithms that can effectively and efficiently produce high-performance devices. Design methods based on iterative optimization can push the performance limits of metasurfaces, but they require extensive computational resources that limit their implementation to small numbers of microscale devices. We show that generative neural networks can train from images of periodic, topology-optimized metagratings to produce high-efficiency, topologically complex devices operating over a broad range of deflection angles and wavelengths. Further iterative optimization of these designs yields devices with enhanced robustness and efficiencies, and these devices can be utilized as additional training data for network refinement. In this manner, generative networks can be trained, with a one-time computation cost, and used as a design tool to facilitate the production of near-optimal, topologically complex device designs. We envision that such data-driven design methodologies can apply to other physical sciences domains that require the design of functional elements operating across a wide parameter space. View details
    Inundation Modeling in Data Scarce Regions
    Vova Anisimov
    Yusef Shafi
    Sella Nevo
    Artificial Intelligence for Humanitarian Assistance and Disaster Response Workshop (2019)
    Preview abstract Flood forecasts are crucial for effective individual and governmental protective action. The vast majority of flood-related casualties occur in developing countries, where providing spatially accurate forecasts is a challenge due to scarcity of data and lack of funding. This paper describes an operational system providing flood extent forecast maps covering several flood-prone regions in India, with the goal of being sufficiently scalable and cost-efficient to facilitate the establishment of effective flood forecasting systems globally. View details
    Learning data-driven discretizations for partial differential equations
    Yohai bar Sinai
    Jason Hickey
    Proceedings of the National Academy of Sciences (2019), pp. 201814058
    Preview abstract The numerical solution of partial differential equations (PDEs) is challenging because of the need to resolve spatiotemporal features over wide length- and timescales. Often, it is computationally intractable to resolve the finest features in the solution. The only recourse is to use approximate coarse-grained representations, which aim to accurately represent long-wavelength dynamics while properly accounting for unresolved small-scale physics. Deriving such coarse-grained equations is notoriously difficult and often ad hoc. Here we introduce data-driven discretization, a method for learning optimized approximations to PDEs based on actual solutions to the known underlying equations. Our approach uses neural networks to estimate spatial derivatives, which are optimized end to end to best satisfy the equations on a low-resolution grid. The resulting numerical methods are remarkably accurate, allowing us to integrate in time a collection of nonlinear equations in 1 spatial dimension at resolutions 4x to 8x coarser than is possible with standard finite-difference methods. View details
    Assessing microscope image focus quality with deep learning
    D. Michael Ando
    Mariya Barch
    Arunachalam Narayanaswamy
    Eric Christiansen
    Chris Roat
    Jane Hung
    Curtis T. Rueden
    Asim Shankar
    Steven Finkbeiner
    BMC Bioinformatics, vol. 19 (2018), pp. 77
    Preview abstract Background: Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality. Results: We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument. Conclusions: Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler. View details
    Xarray: N-D labeled arrays and datasets in Python
    Joe Hamman
    Journal of Open Research Software (2017)
    Preview abstract Xarray is an open source project and Python package that provides data structures for N-dimensional labeled arrays inspired by Pandas. It provides a Pandas-like and Pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data format for which Pandas excels. Our approach adopts the Common Data Model for self-describing scientific data that is widely used in the geo-science community. Xarray builds on top of and seamlessly interoperates with the core scientific Python packages, such as NumPy, SciPy, Matplotlib, and Pandas. View details
    No Results Found