Fei Sha
Research Areas
Authored Publications
Sort By
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Chris Donahue
Dima Kuzmin
Judith Li
Kun Su
Mauro Verzetti
Qingqing Huang
Yu Wang
Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
Preview abstract
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
View details
DySLIM: Dynamics Stable Learning by Invariant Measure for Chaotic Systems
Yair Schiff
Jeff Parker
Volodymyr Kuleshov
International Conference on Machine Learning (ICML) (2024)
Preview abstract
Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by their positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these systems exhibit ergodicity and an attractor: a compact and highly complex manifold, to which trajectories converge in finite-time, that supports an invariant measure, i.e., a probability distribution that is invariant under the action of the dynamics, which dictates the long-term statistical behavior of the system. In this work, we leverage this structure to propose a new framework that targets learning the invariant measure as well as the dynamics, in contrast with typical methods that only target the misfit between trajectories, which often leads to divergence as the trajectories’ length increases. We use our framework to propose a tractable and sample efficient objective that can be used with any existing learning objectives. Our Dynamics Stable Learning by Invariant Measure (DySLIM) objective enables model training that achieves better point-wise tracking and long-term statistical accuracy relative to other learning objectives. By targeting the distribution with a scalable regularization term, we hope that this approach can be extended to more complex systems exhibiting slowly-variant distributions, such as weather and climate models. Code to reproduce our experiments is available here: https://github.com/google-research/swirl-dynamics/tree/main/swirl_dynamics/projects/ergodic.
View details
SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models
John Anderson
Science Advances, 10 (2024), eadk4489
Preview abstract
Probabilistic forecasting is crucial to decision-making under uncertainty about future weather. The dominant approach is to use an ensemble of forecasts to represent and quantify uncertainty in operational numerical weather prediction. However, generating ensembles is computationally costly. In this paper, we propose to generate ensemble forecasts at scale by leveraging recent advances in generative artificial intelligence. Our approach learns a data-driven probabilistic diffusion model from the 5-member ensemble GEFS reforecast dataset. The model can then be sampled efficiently to produce realistic weather forecasts, conditioned on a few members of the operational GEFS forecasting system. The generated ensembles have similar predictive skill as the full GEFS 31-member ensemble, evaluated against ERA5 reanalysis, and emulate well the statistics of large physics-based ensembles. We also apply the same methodology to developing a diffusion model for generative post-processing: the model directly learns to correct biases present in the emulated forecasting system by leveraging reanalysis data as labels during training. Ensembles from this generative post-processing model show greater reliability and accuracy, particularly in extreme event classification. In general, they are more reliable and forecast the probability of extreme weather more accurately than the GEFS operational ensemble. Our models achieve these results at less than 1/10th of the computational cost incurred by the operational GEFS system.
View details
Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models
Ricardo Baptista
Yi-fan Chen
John Anderson
Anudhyan Boral
Advances in Neural Information Processing Systems (NeurIPS) 36 (2023)
Preview abstract
We introduce a two-stage probabilistic framework for statistical downscaling between unpaired data. Statistical downscaling seeks a probabilistic map to transform low-resolution data from a (possibly biased) coarse-grained numerical scheme to high-resolution data that is consistent with a high-fidelity scheme. Our framework tackles the problem by tandeming two transformations: a de-biasing step that is performed by an optimal transport map, and a super-resolution step that is achieved via a probabilistic diffusion model with a posteriori conditional sampling. This approach characterizes a conditional distribution without the need for paired data, and faithfully recovers relevant physical statistics from biased samples.
We demonstrate the utility of the proposed approach on one- and two-dimensional fluid flow problems; they are representative of the core difficulties present in numerical simulations of weather and climate. Our method produces realistic high-resolution outputs from low-resolution inputs, by upsampling resolutions of 8x and 16x. Moreover, the procedure is faithful to the correct statistics of physical quantities, even when the low-frequency energy profiles of the inputs and the desired outputs do not match, a crucial but difficult-to-satisfy assumption by current state-of-the-art alternatives.
View details
Neural Ideal Large Eddy Simulation: Modeling Turbulence with Neural Stochastic Differential Equations
Anudhyan Boral
James Lottes
Yi-fan Chen
John Anderson
Advances in Neural Information Processing Systems (NeurIPS) 36 (2023)
Preview abstract
We introduce a data-driven learning framework that assimilates two powerful ideas: ideal large-eddy-simulation (LES) from turbulence closure modeling and neural stochastic differential equations (SDE) for modeling stochastic dynamical systems. ideal LES identifies the optimal reduced-order flow fields of the large-scale features by marginalizing out the effect of small-scales in stochastic turbulent trajectories. However, ideal LES is analytically intractable. In our work, we use a latent neural SDE to model the evolution of the stochastic process and a pair of encoder and decoder for transforming between the latent space and the desired optimal flow field. This stands in sharp contrast to other types of neural parameterization of the closure models where each trajectory is treated as a deterministic realization of the dynamics. We show the effectiveness of our approach (niLES – neural ideal LES) on a challenging chaotic dynamical systems: Kolmogorov flow at a Reynolds number of 20,000. Compared to prior works, our method is also able to handle non-uniform geometries and unstructured meshes. In particular, niLES leads to more accurate long term statistics, and is stable even when rolling out to long horizons.
View details
WeatherBench 2: A benchmark for the next generation of data-driven global weather models
Alex Merose
Peter Battaglia
Tyler Russell
Alvaro Sanchez
Vivian Yang
Matthew Chantry
Zied Ben Bouallegue
Peter Dueben
Carla Bromberg
Jared Sisk
Luke Barrington
Aaron Bell
arXiv (2023) (to appear)
Preview abstract
WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.
View details
Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems
Anudhyan Boral
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
We present a data-driven, space-time continuous framework to learn surrogate models for complex physical systems described by advection-dominated partial differential equations. Those systems have slow-decaying Kolmogorov n-width that hinders standard methods, including reduced order modeling, from producing high-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compact representation network. We leverage the expressive power of the network and a specially designed consistency-inducing regularization to obtain latent trajectories that are both low-dimensional and smooth. These properties render our surrogate models highly efficient at inference time. We show the efficacy of our framework by learning models that generate accurate multi-step rollout predictions at much faster inference speed compared to competitors, for several challenging examples.
View details
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
Lluis Castrejon
Arushi Goel
Felipe Cadar
Vittorio Ferrari
ICCV (2023)
Preview abstract
We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa.
View details
Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing
Ice Pasupat
Joshua Ainslie
Linlu Qiu
Michiel de Jong
Yury Zemlyanskiy
Proceedings of COLING (2022)
Preview abstract
A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar inputs. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandR first generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR sets the state of the art on multiple low-resource semantic parsing tasks.
View details
Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
Michiel de Jong
Yury Zemlyanskiy
10th International Conference on Learning Representations, ICLR 2022, Virtual Conference , April 25-29, 2022, OpenReview.net
Preview abstract
Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge.
Specifically, our method represents knowledge as a ``mention memory" containing a dense vector representation of every entity mention in a corpus. The Transformer model accesses the information through internal memory layers in which each entity mention in the passage being read attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information \textit{within} a single Transformer model.
In experiments using a memory of ~150 million Wikipedia mentions, our model provides to strong improvements in performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks FEVER and HoVeR and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we show that the model can be adapted to generalize to new unseen entities by updating the memory, without retraining.
View details