Jump to Content
Brendan Jou

Brendan Jou

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech. View details
    DISSECT: Disentangled Simultaneous Explanations via Concept Traversals
    Chun-Liang Li
    Brian Eoff
    Rosalind Picard
    International Conference on Learning Representations (2022)
    Preview abstract Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore "what-if" scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier's decision. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier's decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions. View details
    Preview abstract Federated learning mitigates the need to store user data in a central datastore for machine learning tasks, and is particularly beneficial when working with sensitive user data or tasks. Although successfully used for applications such as improving keyboard query suggestions, it is not studied systematically for modeling affective computing tasks which are often laden with subjective labels and high variability across individuals/raters or even by the same participant. In this paper, we study the federated averaging algorithm FedAvg to model self-reported emotional experience and perception labels on a variety of speech, video and text datasets. We identify two learning paradigms that commonly arise in affective computing tasks: modeling of selfreports (user-as-client), and modeling perceptual judgments such as labeling sentiment of online comments (rater-as-client). In the user-as-client setting, we show that FedAvg generally performs on-par with a non-federated model in classifying self-reports. In the rater-as-client setting, FedAvg consistently performed poorer than its non-federated counterpart. We found that the performance of FedAvg degraded for classes where the interrater agreement was moderate to low. To address this finding, we propose an algorithm FedRater that learns client-specific label distributions in federated settings. Our experimental results show that FedRater not only improves the overall classification performance compared to FedAvg but also provides insights for estimating proxies of inter-rater agreement in distributed settings. View details
    Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers
    Josh Belanich
    Brian Eoff
    ICML Expressive Vocalizations Workshop & Competition (2022)
    Preview abstract This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the MultiTask track. We then sought to characterize the headroom in the MultiTask track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting. View details
    BasisNet: Two-Stage Model Synthesis for Efficient Inference
    Chun-Te Chu
    Andrew Howard
    Yukun Zhu
    Rebecca Hwa
    Adriana Kovashka
    CVPR Workshop on Efficient Deep Learning for Computer Vision (ECV) (2021)
    Preview abstract In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet. View details
    Preview abstract Understanding the degree to which human facial expressions co-vary with specific social contexts across cultures is central to the theory that emotions enable adaptive responses to important challenges and opportunities. Concrete evidence linking social context to specific facial expressions is sparse and is largely based on survey-based approaches, which are often constrained by language and small sample sizes. Here, by applying machine-learning methods to real-world, dynamic behaviour, we ascertain whether naturalistic social contexts (for example, weddings or sporting competitions) are associated with specific facial expressions across different cultures. In two experiments using deep neural networks, we examined the extent to which 16 types of facial expression occurred systematically in thousands of contexts in 6 million videos from 144 countries. We found that each kind of facial expression had distinct associations with a set of contexts that were 70% preserved across 12 world regions. Consistent with these associations, regions varied in how frequently different facial expressions were produced as a function of which contexts were most salient. Our results reveal fine-grained patterns in human facial expressions that are preserved across the modern world. View details
    Characterizing Sources of Uncertainty to Proxy Calibration and Disambiguate Annotator and Data Bias
    Brian Eoff
    Rosalind Picard
    ICCV Workshop on Interpreting and Explaining Visual Artificial Intelligence Models (2019)
    Preview abstract Supporting model interpretability for complex phenomena where annotators can legitimately disagree, such as emotion recognition, is a challenging machine learning task. In this work, we show that explicitly quantifying the uncertainty in such settings has interpretability benefits. We use a simple modification of a classical network inference using Monte Carlo dropout to give measures of epistemic and aleatoric uncertainty. We identify a significant correlation between aleatoric uncertainty and human annotator disagreement (r ≈ .3). Additionally, we demonstrate how difficult and subjective training samples can be identified using aleatoric uncertainty and how epistemic uncertainty can reveal data bias that could result in unfair predictions. We identify the total uncertainty as a suitable surrogate for model calibration, i.e. the degree we can trust model's predicted confidence. In addition to explainability benefits, we observe modest performance boosts from incorporating model uncertainty. View details
    Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
    Víctor Campos
    Xavier Giró-i-Nieto
    Jordi Torres
    Shih-Fu Chang
    ICLR (2018)
    Preview abstract Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. View details
    Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
    Víctor Campos
    Xavier Giró-i-Nieto
    Jordi Torres
    Shih-Fu Chang
    NeurIPS Time Series Workshop (2017)
    Preview abstract Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. View details
    Sentiment Concept Embedding for Visual Affect Recognition
    Víctor Campos
    Xavier Giró-i-Nieto
    Jordi Torres
    Shih-Fu Chang
    Multimodal Behavior Analysis in the Wild, Academic Press (2019)
    Multilingual Visual Sentiment Concept Clustering and Analysis
    Nikolaos Pappas
    Miriam Redi
    Mercan Topkara
    Hongyi Liu
    Tao Chen
    Shih-Fu Chang
    International Journal of Multimedia Information Retrieval, vol. 6 (2017)
    A Survey of Multimodal Sentiment Analysis
    Mohammad Soleymani
    David Garcia
    Björn Schuller
    Shih-Fu Chang
    Maja Pantic
    Image and Vision Computing, vol. 65 (2017)
    From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction
    Víctor Campos
    Xavier Giró-i-Nieto
    Image and Vision Computing, vol. 65 (2017)
    SentiCart: Cartography and Geo-contextualization for Multilingual Visual Sentiment
    Margaret Yuying Qian
    Shih-Fu Chang
    ACM International Conference on Multimedia Retrieval (ICMR) (2016)
    Tamp: A Library for Compact Deep Neural Networks with Structured Matrices
    Bingchen Gong
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2016)
    Multilingual Visual Sentiment Concept Matching
    Nikolaos Pappas
    Miriam Redi
    Mercan Topkara
    Hongyi Liu
    Tao Chen
    Shih-Fu Chang
    ACM International Conference on Multimedia Retrieval (ICMR) (2016)
    Deep Cross Residual Learning for Multitask Visual Recognition
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2016)
    Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology
    Tao Chen
    Nikolaos Pappas
    Miriam Redi
    Mercan Topkara
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2015)
    Predicting Viewer Perceived Emotions in Animated GIFs
    Subhabrata Bhattacharya
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2014)
    News Rover: Exploring Topical Structures and Serendipity in Heterogeneous Multimedia News
    Hongzhi Li
    Joseph G. Ellis
    Daniel Morozoff-Abegauz
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2013)
    Robust Object Co-Detection
    Xin Guo
    Dong Liu
    Mojun Zhu
    Anni Cai
    Shih-Fu Chang
    Computer Vision and Pattern Recognition (CVPR) (2013)
    Structured Exploration of Who, What, When, and Where in Heterogeneous Multimedia News Sources
    Hongzhi Li
    Joseph G. Ellis
    Daniel Morozoff-Abegauz
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2013)