Rif A. Saurous
Authored Publications
Sort By
Sequential Monte Carlo Learning for Time Series Structure Discovery
Feras Saad
Matthew D. Hoffman
Vikash Mansinghka
Proceedings of the 40th International Conference on Machine Learning (2023), pp. 29473-29489
Preview abstract
This paper presents a new approach to automatically discovering accurate
models of complex time series data. Working within a Bayesian nonparametric
prior over a symbolic space of Gaussian process time series models, we
present a novel structure learning algorithm that integrates sequential
Monte Carlo (SMC) and involutive MCMC for highly effective posterior
inference. Our method can be used both in "online'' settings, where new
data is incorporated sequentially in time, and in "offline'' settings, by
using nested subsets of historical data to anneal the posterior. Empirical
measurements on a variety of real-world time series show that our method
can deliver 10x--100x runtime speedups over previous MCMC and greedy-search
structure learning algorithms for the same model family. We use our method
to perform the first large-scale evaluation of Gaussian process time series
structure learning on a widely used benchmark of 1,428 monthly econometric
datasets, showing that our method discovers sensible models that deliver
more accurate point forecasts and interval forecasts over multiple horizons
as compared to prominent statistical and neural baselines that struggle on
this challenging data.
View details
Automatically batching control-intensive programs for modern accelerators
Alexey Radul
Dougal Maclaurin
Matthew D. Hoffman
Third Conference on Systems and Machine Learning, Austin, TX (2020)
Preview abstract
We present a general approach to batching arbitrary computations for
GPU and TPU accelerators. We demonstrate the effectiveness of our
method with orders-of-magnitude speedups on the No U-Turn Sampler
(NUTS), a workhorse algorithm in Bayesian statistics. The central
challenge of batching NUTS and other Markov chain Monte Carlo
algorithms is data-dependent control flow and recursion. We overcome
this by mechanically transforming a single-example implementation into
a form that explicitly tracks the current program point for each batch
member, and only steps forward those in the same place. We present
two different batching algorithms: a simpler, previously published one
that inherits recursion from the host Python, and a more complex,
novel one that implmenents recursion directly and can batch across it.
We implement these batching methods as a general program
transformation on Python source. Both the batching system and the
NUTS implementation presented here are available as part of the
popular TensorFlow Probability software package.
View details
Large-Scale Weakly-Supervised Content Embeddingsfor Music Recommendation and Tagging
Qingqing Huang
Li Zhang
John Roberts Anderson
ICASSP 2020 (2020)
Preview abstract
We explore content-based representation learning strategies tailored for
large-scale, uncurated music collections that afford only weak supervision
through unstructured natural language metadata and co-listen statistics. At the
core is a hybrid training scheme that uses classification and metric learning
losses to incorporate both metadata-derived text labels and aggregate co-listen
supervisory signals into a single convolutional model. The resulting joint text
and audio content embedding defines a similarity metric and supports prediction
of semantic text labels using a vocabulary of unprecedented granularity, which
we refine using a novel word-sense disambiguation procedure. As input to simple
classifier architectures, our representation achieves state-of-the-art
performance on two music tagging benchmarks.
View details
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Proceedings of ICASSP 2020 (2020) (to appear)
Preview abstract
Humans do not acquire perceptual abilities like we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies far greater on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a novel clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate hypothesized categories into relevant semantic classes. By jointly training a single sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate
up to 20-fold reduction in labels required to reach a desired classification performance.
View details
Estimating the Changing Infection Rate of COVID-19 Using Bayesian Models of Mobility
Xue Ben
Shawn O'Banion
Matthew D. Hoffman
medRxiv, https://www.medrxiv.org/content/10.1101/2020.08.06.20169664v1.full (2020)
Preview abstract
In order to prepare for and control the continued spread of the COVID-19 pandemic while minimizing its economic impact, the world needs to be able to estimate and predict COVID-19’s spread.
Unfortunately, we cannot directly observe the prevalence or growth rate of COVID-19; these must be inferred using some kind of model.
We propose a hierarchical Bayesian extension to the classic susceptible-exposed-infected-removed (SEIR) compartmental model that adds compartments to account for isolation and death and allows the infection rate to vary as a function of both mobility data collected from mobile phones and a latent time-varying factor that accounts for changes in behavior not captured by mobility data. Since confirmed-case data is unreliable, we infer the model’s parameters conditioned on deaths data. We replace the exponential-waiting-time assumption of classic compartmental models with Erlang distributions, which allows for a more realistic model of the long lag between exposure and death. The mobility data gives us a leading indicator that can quickly detect changes in the pandemic’s local growth rate and forecast changes in death rates weeks ahead of time. This is an analysis of observational data, so any causal interpretations of the model's inferences should be treated as suggestive at best; nonetheless, the model’s inferred relationship between different kinds of trips and the infection rate do suggest some possible hypotheses about what kinds of activities might contribute most to COVID-19’s spread.
View details
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
Jeremy Thorpe
Michael Chinen
IEEE International Conference on Acoustics, Speech, and Signal Processing (2019)
Preview abstract
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system’s output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
View details
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Prashant Sridhar
Ye Jia
ICASSP 2019 (2018)
Preview abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
View details
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Ying Xiao
Yuxuan Wang
Joel Shor
International Conference on Machine Learning (2018)
Preview abstract
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.
View details
Fixing a Broken ELBO
Alex Alemi
Ben Poole
Josh Dillon
Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmässan, Stockholm Sweden (2018), pp. 159-168
Preview abstract
Recent work in unsupervised representation learning has focused on learning deep directed latent variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code.
View details
Simple, Distributed, and Accelerated Probabilistic Programming
Matthew D. Hoffman
Dave Moore
Christopher Gordon Suter
Srinivas Vasudevan
Alexey Radul
Matthew Johnson
NeurIPS (2018)
Preview abstract
We describe Edward2, a low-level probabilistic programming language. Edward2 distills the core of probabilistic programming down to a single abstraction—the random variable. By blurring the line between model and computation, Edward2 enables numerous applications not shown before: a model-parallel variational auto-encoder (VAE) with tensor processing units (TPUs); a data-parallel autoregressive model (Image Transformer) with TPUs; and multi-GPU No-U-Turn Sampler (NUTS). Edward2 achieves an optimal linear speedup from 4 to 256 TPUs. With VAEs, Edward2 sees up to a 20x speedup on TPUs over Pyro and Edward on GPUs; with Bayesian neural networks, Edward2 sees up to a 51x speedup. With NUTS, Edward2 sees a 20x speedup on GPUs over Stan and 7x over PyMC3.
View details