Rif A. Saurous
Authored Publications
Sort By
Sequential Monte Carlo Learning for Time Series Structure Discovery
Feras Saad
Matthew D. Hoffman
Vikash Mansinghka
Proceedings of the 40th International Conference on Machine Learning (2023), pp. 29473-29489
Preview abstract
This paper presents a new approach to automatically discovering accurate
models of complex time series data. Working within a Bayesian nonparametric
prior over a symbolic space of Gaussian process time series models, we
present a novel structure learning algorithm that integrates sequential
Monte Carlo (SMC) and involutive MCMC for highly effective posterior
inference. Our method can be used both in "online'' settings, where new
data is incorporated sequentially in time, and in "offline'' settings, by
using nested subsets of historical data to anneal the posterior. Empirical
measurements on a variety of real-world time series show that our method
can deliver 10x--100x runtime speedups over previous MCMC and greedy-search
structure learning algorithms for the same model family. We use our method
to perform the first large-scale evaluation of Gaussian process time series
structure learning on a widely used benchmark of 1,428 monthly econometric
datasets, showing that our method discovers sensible models that deliver
more accurate point forecasts and interval forecasts over multiple horizons
as compared to prominent statistical and neural baselines that struggle on
this challenging data.
View details
Large-Scale Weakly-Supervised Content Embeddingsfor Music Recommendation and Tagging
Qingqing Huang
Li Zhang
John Roberts Anderson
ICASSP 2020 (2020)
Preview abstract
We explore content-based representation learning strategies tailored for
large-scale, uncurated music collections that afford only weak supervision
through unstructured natural language metadata and co-listen statistics. At the
core is a hybrid training scheme that uses classification and metric learning
losses to incorporate both metadata-derived text labels and aggregate co-listen
supervisory signals into a single convolutional model. The resulting joint text
and audio content embedding defines a similarity metric and supports prediction
of semantic text labels using a vocabulary of unprecedented granularity, which
we refine using a novel word-sense disambiguation procedure. As input to simple
classifier architectures, our representation achieves state-of-the-art
performance on two music tagging benchmarks.
View details
Estimating the Changing Infection Rate of COVID-19 Using Bayesian Models of Mobility
Xue Ben
Shawn O'Banion
Matthew D. Hoffman
medRxiv, https://www.medrxiv.org/content/10.1101/2020.08.06.20169664v1.full (2020)
Preview abstract
In order to prepare for and control the continued spread of the COVID-19 pandemic while minimizing its economic impact, the world needs to be able to estimate and predict COVID-19’s spread.
Unfortunately, we cannot directly observe the prevalence or growth rate of COVID-19; these must be inferred using some kind of model.
We propose a hierarchical Bayesian extension to the classic susceptible-exposed-infected-removed (SEIR) compartmental model that adds compartments to account for isolation and death and allows the infection rate to vary as a function of both mobility data collected from mobile phones and a latent time-varying factor that accounts for changes in behavior not captured by mobility data. Since confirmed-case data is unreliable, we infer the model’s parameters conditioned on deaths data. We replace the exponential-waiting-time assumption of classic compartmental models with Erlang distributions, which allows for a more realistic model of the long lag between exposure and death. The mobility data gives us a leading indicator that can quickly detect changes in the pandemic’s local growth rate and forecast changes in death rates weeks ahead of time. This is an analysis of observational data, so any causal interpretations of the model's inferences should be treated as suggestive at best; nonetheless, the model’s inferred relationship between different kinds of trips and the infection rate do suggest some possible hypotheses about what kinds of activities might contribute most to COVID-19’s spread.
View details
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Proceedings of ICASSP 2020 (2020) (to appear)
Preview abstract
Humans do not acquire perceptual abilities like we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies far greater on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a novel clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate hypothesized categories into relevant semantic classes. By jointly training a single sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate
up to 20-fold reduction in labels required to reach a desired classification performance.
View details
Automatically batching control-intensive programs for modern accelerators
Alexey Radul
Dougal Maclaurin
Matthew D. Hoffman
Third Conference on Systems and Machine Learning, Austin, TX (2020)
Preview abstract
We present a general approach to batching arbitrary computations for
GPU and TPU accelerators. We demonstrate the effectiveness of our
method with orders-of-magnitude speedups on the No U-Turn Sampler
(NUTS), a workhorse algorithm in Bayesian statistics. The central
challenge of batching NUTS and other Markov chain Monte Carlo
algorithms is data-dependent control flow and recursion. We overcome
this by mechanically transforming a single-example implementation into
a form that explicitly tracks the current program point for each batch
member, and only steps forward those in the same place. We present
two different batching algorithms: a simpler, previously published one
that inherits recursion from the host Python, and a more complex,
novel one that implmenents recursion directly and can batch across it.
We implement these batching methods as a general program
transformation on Python source. Both the batching system and the
NUTS implementation presented here are available as part of the
popular TensorFlow Probability software package.
View details
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
Jeremy Thorpe
Michael Chinen
IEEE International Conference on Acoustics, Speech, and Signal Processing (2019)
Preview abstract
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system’s output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
View details
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Prashant Sridhar
Ye Jia
ICASSP 2019 (2018)
Preview abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
View details
Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions
Jonathan Shen
Ruoming Pang
Mike Schuster
Navdeep Jaitly
Zongheng Yang
Yu Zhang
Yuxuan Wang
Yannis Agiomyrgiannakis
ICASSP (2018)
Preview abstract
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
View details
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Yuxuan Wang
Yu Zhang
Joel Shor
Ying Xiao
Fei Ren
Ye Jia
ICML (2018)
Preview abstract
In this work, we propose “global style tokens”(GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained in a completely unsupervised manner, and yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of surprising results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and modifying speak-ing style – independently of the text content. The labels can also be used for style transfer, replicating the speaking style of one “seed” phrase across an entire long-form text corpus. Perhaps most surprisingly, when trained on noisy, unlabelled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scaleable but robust speech synthesis.
View details
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Ying Xiao
Yuxuan Wang
Joel Shor
International Conference on Machine Learning (2018)
Preview abstract
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.
View details