Fei Sha

Fei Sha

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press(2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    Preview abstract Probabilistic forecasting is crucial to decision-making under uncertainty about future weather. The dominant approach is to use an ensemble of forecasts to represent and quantify uncertainty in operational numerical weather prediction. However, generating ensembles is computationally costly. In this paper, we propose to generate ensemble forecasts at scale by leveraging recent advances in generative artificial intelligence. Our approach learns a data-driven probabilistic diffusion model from the 5-member ensemble GEFS reforecast dataset. The model can then be sampled efficiently to produce realistic weather forecasts, conditioned on a few members of the operational GEFS forecasting system. The generated ensembles have similar predictive skill as the full GEFS 31-member ensemble, evaluated against ERA5 reanalysis, and emulate well the statistics of large physics-based ensembles. We also apply the same methodology to developing a diffusion model for generative post-processing: the model directly learns to correct biases present in the emulated forecasting system by leveraging reanalysis data as labels during training. Ensembles from this generative post-processing model show greater reliability and accuracy, particularly in extreme event classification. In general, they are more reliable and forecast the probability of extreme weather more accurately than the GEFS operational ensemble. Our models achieve these results at less than 1/10th of the computational cost incurred by the operational GEFS system. View details
    WeatherBench 2: A benchmark for the next generation of data-driven global weather models
    Alex Merose
    Peter Battaglia
    Tyler Russell
    Alvaro Sanchez
    Vivian Yang
    Matthew Chantry
    Zied Ben Bouallegue
    Peter Dueben
    Carla Bromberg
    Jared Sisk
    Luke Barrington
    Aaron Bell
    arXiv(2023) (to appear)
    Preview abstract WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting. View details
    Preview abstract We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa. View details
    Preview abstract We present a data-driven, space-time continuous framework to learn surrogate models for complex physical systems described by advection-dominated partial differential equations. Those systems have slow-decaying Kolmogorov n-width that hinders standard methods, including reduced order modeling, from producing high-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compact representation network. We leverage the expressive power of the network and a specially designed consistency-inducing regularization to obtain latent trajectories that are both low-dimensional and smooth. These properties render our surrogate models highly efficient at inference time. We show the efficacy of our framework by learning models that generate accurate multi-step rollout predictions at much faster inference speed compared to competitors, for several challenging examples. View details
    Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
    Michiel de Jong
    Yury Zemlyanskiy
    10th International Conference on Learning Representations, ICLR 2022, Virtual Conference , April 25-29, 2022, OpenReview.net
    Preview abstract Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge as a ``mention memory" containing a dense vector representation of every entity mention in a corpus. The Transformer model accesses the information through internal memory layers in which each entity mention in the passage being read attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information \textit{within} a single Transformer model. In experiments using a memory of ~150 million Wikipedia mentions, our model provides to strong improvements in performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks FEVER and HoVeR and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we show that the model can be adapted to generalize to new unseen entities by updating the memory, without retraining. View details
    Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing
    Ice Pasupat
    Joshua Ainslie
    Linlu Qiu
    Michiel de Jong
    Yury Zemlyanskiy
    Proceedings of COLING(2022)
    Preview abstract A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar inputs. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandR first generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR sets the state of the art on multiple low-resource semantic parsing tasks. View details
    When MAML Can Adapt Fast and How to Assist When It Cannot
    Sebastien Arnold
    Shariq Iqbal
    Proceedings of AISTATS 2021
    Preview abstract Model-Agnostic Meta-Learning (MAML) and its variants have achieved success in meta-learning tasks on many datasets and settings. On the other hand, we have just started to understand and analyze how they are able to adapt fast to new tasks. For example, one popular hypothesis is that the algorithms learn good representations for transfer, as in multi-task learning. In this work, we contribute by providing a series of empirical and theoretical studies, and discover several interesting yet previously unknown properties of the algorithm. We find MAML adapts better with a deep architecture even if the tasks need only a shallow one (and thus, no representation learning is needed). While echoing previous findings by others that the bottom layers in deep architectures enable representation learning, we also find that upper layers enable fast adaptation by being meta-learned to perform adaptive gradient update when generalizing to new tasks. Motivated by these findings, we study several meta-optimization approaches and propose a new one for learning to optimize adaptively. Those approaches attain stronger performance in meta-learning both shallower and deeper architectures than MAML. View details
    ReadTwice: Reading Very Large Documents with Memories
    Yury Zemlyanskiy
    Joshua Ainslie
    Michiel de Jong
    Ilya Eckstein
    Proceedings of NAACL(2021) (to appear)
    Preview abstract Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs, like books or collections of articles. We propose ReadTwice, a simple and effective approach to combine the advantages of existing approaches that modify Transformers to model long-range dependencies. The main idea is to read smaller segments of the text and summarize them into a memory table to be used in a second read of the text. We show that the model outperforms models of comparable size on several QA datasets and sets the state of the art on the challenging NarrativeQA dataset which asks questions about entire books. View details
    DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections
    Yury Zemlyanskiy
    Sudeep Gandhe
    Ruining He
    Anirudh Ravula
    Juro Gottweis
    Ilya Eckstein
    Proceedings of EACL(2021) (to appear)
    Preview abstract This paper explores learning rich self-supervised entity representations from large amounts of associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as search ranked retrieval, knowledge base completion, question answering and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include {\em any} available text related to an entity. With the breadth and depth of textual content available on the web, this approach enables a new class of powerful, high-capacity representations that can ultimately ``remember" any useful information about an entity, without the need for human annotations. We present several training strategies that jointly learn to predict words and entities --- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models outperform competitive baselines, sometimes with little or no fine-tuning, and are also able to scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available\footnote{To be released after the review period.}. This includes {\em Reviews2Movielens}, mapping the 1B word corpus of Amazon movie reviews to MovieLens tags, as well as Reddit Movie Suggestions containing natural language queries and corresponding community recommendations. View details