Aren Jansen

Aren Jansen

I am currently a Research Scientist at Google, working in the Sound Understanding Group on machine learning for speech, music and audio processing. Before joining Google, I was a Research Scientist at the Johns Hopkins University Human Language Technology Center of Excellence, an Assistant Research Professor in the John Hopkins Department of Electrical and Computer Engineering, and a faculty member of the Center for Language and Speech Processing. My research has explored a wide range of ML topics that involve unsupervised/semi-supervised representation learning, information retrieval, content-based recommendation, latent structure discovery, time series modeling and analysis, and scalable algorithms for big data applications. See my personal website or my Google scholar page for a full list of publications.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press(2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Neil Zeghidour
    Christian Frank
    under review(2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    MuLan: A Joint Embedding of Music Audio and Natural Language
    Qingqing Huang
    Ravi Ganti
    Judith Yue Li
    Proceedings of the the 23rd International Society for Music Information Retrieval Conference (ISMIR)(2022) (to appear)
    Preview abstract Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications. View details
    Preview abstract Amyotrophic Lateral Sclerosis (ALS) disease progression is usually measured using the subjective, questionnaire-based revised ALS Functional Rating Scale (ALSFRS-R). A purely objective measure for tracking disease progression would be a powerful tool for evaluating real-world drug effectiveness, efficacy in clinical trials, as well as identifying participants for cohort studies. Here we develop a machine learning based objective measure for ALS disease progression, based on voice samples and accelerometer measurements. The ALS Therapy Development Institute (ALS-TDI) collected a unique dataset of voice and accelerometer samples from consented individuals - 584 people living with ALS over four years. Participants carried out prescribed speaking and limb-based tasks. 542 participants contributed 5814 voice recordings, and 350 contributed 13009 accelerometer samples, while simultaneously measuring ALSFRS-R. Using the data from 475 participants, we trained machine learning (ML) models, correlating voice with bulbar-related FRS scores and accelerometer with limb related scores. On the test set (n=109 participants) the voice models achieved an AUC of 0.86 (95% CI, 0.847-0.884) , whereas the accelerometer models achieved a median AUC of 0.73 . We used the models and self-reported ALSFRS-R scores to evaluate the real-world effects of edaravone, a drug recently approved for use in ALS, on 54 test participants. In the test cohort, the digital data input into the ML models produced objective measures of progression rates over the duration of the study that were consistent with self-reported scores. This demonstrates the value of these tools for assessing both disease progression and potentially drug effects. In this instance, outcomes from edaravone treatment, both self-reported and digital-ML, resulted in highly variable outcomes from person to person. View details
    Shared computational principles for language processing in humans and deep language models
    Ariel Goldstein
    Zaid Zada
    Eliav Buchnik
    Amy Price
    Bobbi Aubrey
    Samuel A. Nastase
    Harshvardhan Gazula
    Gina Choe
    Aditi Rao
    Catherine Kim
    Colton Casto
    Lora Fanda
    Werner Doyle
    Daniel Friedman
    Patricia Dugan
    Lucia Melloni
    Roi Reichart
    Sasha Devore
    Adeen Flinker
    Liat Hasenfratz
    Omer Levy,
    Kenneth A. Norman
    Orrin Devinsky
    Uri Hasson
    Nature Neuroscience(2022)
    Preview abstract Departing from traditional linguistic models, advances in deep learning have resulted in a new type of predictive (autoregressive) deep language models (DLMs). Using a self-supervised next-word prediction task, these models generate appropriate linguistic responses in a given context. In the current study, nine participants listened to a 30-min podcast while their brain responses were recorded using electrocorticography (ECoG). We provide empirical evidence that the human brain and autoregressive DLMs share three fundamental computational principles as they process the same natural narrative: (1) both are engaged in continuous next-word prediction before word onset; (2) both match their pre-onset predictions to the incoming word to calculate post-onset surprise; (3) both rely on contextual embeddings to represent words in natural contexts. Together, our findings suggest that autoregressive DLMs provide a new and biologically feasible computational framework for studying the neural basis of language. View details
    Preview abstract Many speech applications require understanding aspects other than content, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. Generally-useful paralinguistic speech representations offer one solution to these kinds of problems. In this work, we introduce a new state-of-the-art paralinguistic speech representation based on self-supervised training of a 600M+ parameter Conformer-based architecture. Linear classifiers trained on top of our best representation outperform previous results on 7 of 8 tasks we evaluate. We perform a larger comparison than has been done previously both in terms of number of embeddings compared and number of downstream datasets evaluated on. Our analyses into the role of time demonstrate the importance of context window size for many downstream tasks. Furthermore, while the optimal representation is extracted internally in the network, we demonstrate stable high performance across several layers, allowing a single universal representation to reach near optimal performance on all tasks. View details
    Preview abstract Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB. View details
    Preview abstract Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark. View details
    Preview abstract Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips. View details
    A Convolutional Neural Network for Automated Detection of Humpback Whale Song in a Diverse, Long-Term Passive Acoustic Dataset
    Ann N. Allen
    Matt Harvey
    Karlina P. Merkens
    Carrie C. Wall
    Erin M. Oleson
    Frontiers in Marine Science, 8(2021), pp. 165
    Preview abstract Passive acoustic monitoring is a well-established tool for researching the occurrence, movements, and ecology of a wide variety of marine mammal species. Advances in hardware and data collection have exponentially increased the volumes of passive acoustic data collected, such that discoveries are now limited by the time required to analyze rather than collect the data. In order to address this limitation, we trained a deep convolutional neural network (CNN) to identify humpback whale song in over 187,000 h of acoustic data collected at 13 different monitoring sites in the North Pacific over a 14-year period. The model successfully detected 75 s audio segments containing humpback song with an average precision of 0.97 and average area under the receiver operating characteristic curve (AUC-ROC) of 0.992. The model output was used to analyze spatial and temporal patterns of humpback song, corroborating known seasonal patterns in the Hawaiian and Mariana Islands, including occurrence at remote monitoring sites beyond well-studied aggregations, as well as novel discovery of humpback whale song at Kingman Reef, at 5∘ North latitude. This study demonstrates the ability of a CNN trained on a small dataset to generalize well to a highly variable signal type across a diverse range of recording and noise conditions. We demonstrate the utility of active learning approaches for creating high-quality models in specialized domains where annotations are rare. These results validate the feasibility of applying deep learning models to identify highly variable signals across broad spatial and temporal scales, enabling new discoveries through combining large datasets with cutting edge tools. View details