Krishna Somandepalli
krishna.ai
Authored Publications
Sort By
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
LanSER: Language-Model Supported Speech Emotion Recognition
Taesik Gong
Josh Belanich
Brian Eoff
INTERSPEECH 2023 (to appear)
Preview abstract
Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.
View details
Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers
Josh Belanich
Brian Eoff
ICML Expressive Vocalizations Workshop & Competition (2022)
Preview abstract
This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the MultiTask track. We then sought to characterize the headroom in the MultiTask track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting.
View details
Federated Learning for Affective Computing Tasks
Brian Eoff
Alan Cowen
Josh Belanich
IEEE (2022), pp. 1-8
Preview abstract
Federated learning mitigates the need to store user data in a central datastore for machine learning tasks, and is particularly beneficial when working with sensitive user data or tasks. Although successfully used for applications such as improving keyboard query suggestions, it is not studied systematically for modeling affective computing tasks which are often laden with
subjective labels and high variability across individuals/raters or even by the same participant. In this paper, we study the federated averaging algorithm FedAvg to model self-reported
emotional experience and perception labels on a variety of speech, video and text datasets. We identify two learning paradigms that commonly arise in affective computing tasks: modeling of selfreports (user-as-client), and modeling perceptual judgments such as labeling sentiment of online comments (rater-as-client). In the user-as-client setting, we show that FedAvg generally performs on-par with a non-federated model in classifying self-reports. In the rater-as-client setting, FedAvg consistently performed poorer than its non-federated counterpart. We found that the performance of FedAvg degraded for classes where the interrater agreement was moderate to low. To address this finding, we propose an algorithm FedRater that learns client-specific label distributions in federated settings. Our experimental results show that FedRater not only improves the overall classification performance compared to FedAvg but also provides insights for estimating proxies of inter-rater agreement in distributed settings.
View details