Jump to Content

Krishna Somandepalli

krishna.ai
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Dan Kondratyuk
    Lijun Yu
    Xiuye Gu
    Rachel Hornung
    Hassan Akbari
    Ming-Chang Chiu
    Josh Dillon
    Agrim Gupta
    Meera Hahn
    Anja Hauth
    David Hendon
    Alonso Martinez
    Grant Schindler
    Huisheng Wang
    Jimmy Yan
    Xuan Yang
    Lu Jiang
    arxiv Preprint (2023) (to appear)
    Preview abstract We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/ View details
    Preview abstract Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech. View details
    Preview abstract Federated learning mitigates the need to store user data in a central datastore for machine learning tasks, and is particularly beneficial when working with sensitive user data or tasks. Although successfully used for applications such as improving keyboard query suggestions, it is not studied systematically for modeling affective computing tasks which are often laden with subjective labels and high variability across individuals/raters or even by the same participant. In this paper, we study the federated averaging algorithm FedAvg to model self-reported emotional experience and perception labels on a variety of speech, video and text datasets. We identify two learning paradigms that commonly arise in affective computing tasks: modeling of selfreports (user-as-client), and modeling perceptual judgments such as labeling sentiment of online comments (rater-as-client). In the user-as-client setting, we show that FedAvg generally performs on-par with a non-federated model in classifying self-reports. In the rater-as-client setting, FedAvg consistently performed poorer than its non-federated counterpart. We found that the performance of FedAvg degraded for classes where the interrater agreement was moderate to low. To address this finding, we propose an algorithm FedRater that learns client-specific label distributions in federated settings. Our experimental results show that FedRater not only improves the overall classification performance compared to FedAvg but also provides insights for estimating proxies of inter-rater agreement in distributed settings. View details
    Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers
    Josh Belanich
    Brian Eoff
    ICML Expressive Vocalizations Workshop & Competition (2022)
    Preview abstract This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the MultiTask track. We then sought to characterize the headroom in the MultiTask track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting. View details
    No Results Found