Jump to Content
Ting Liu

Ting Liu

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
    Qitong Wang
    Liangzhe Yuan
    Xi Peng
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. View details
    Preview abstract This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub. View details
    Learning to Generate Image Embeddings with User-level Differential Privacy
    Maxwell D. Collins
    Yuxiao Wang
    Sewoong Oh
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023) (to appear)
    Preview abstract We consider training feature extractors with user-level differential privacy to map images to embeddings from large-scale supervised data. To achieve user-level differential privacy, federated learning algorithms are extended and applied to aggregate user partitioned data, together with sensitivity control and noise addition. We demonstrate a variant of federated learning algorithm with partial aggregation and private reconstruction can achieve strong privacy utility trade-offs. When a large scale dataset is provided, it is possible to train feature extractors with both strong utility and privacy guarantees by combining techniques such as public pretraining, virtual clients, and partial aggregation. View details
    Contextualized Spatial-Temporal Contrastive Learning with Self-Supervision
    Liangzhe Yuan
    Rui Qian
    Yin Cui
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 13977-13986
    Preview abstract Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB. View details
    View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
    Jennifer Jianing Sun
    Jiaping Zhao
    Liangzhe Yuan
    Yuxiao Wang
    Liang-Chieh Chen
    International Journal of Computer Vision, vol. 130 (2022), pp. 111-135
    Preview abstract Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task. View details
    Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
    Jingxiao Zheng
    Xinwei Shi
    Alexander Gorban
    Junhua Mao
    Charles Qi
    Visesh Chari
    Andre Cornman
    Yin Zhou
    Dragomir Anguelov
    CVPR'2022, Workshop on Autonomous Driving, IEEE
    Preview abstract 3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions. View details
    COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
    Honglu Zhou
    Asim Kadav
    Aviv Shamsian
    Shijie Geng
    Farley Lai
    Mubbasir Kapadia
    Hans Peter Graf
    European Conference on Computer Vision (ECCV) (2022), pp. 249-266
    Preview abstract Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model's strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality. Code is available at https://github.com/hongluzhou/composer. View details
    Surrogate Gap Minimization Improves Sharpness Aware Training
    Juntang Zhuang
    Liangzhe Yuan
    Yin Cui
    Nicha C. Dvornek
    Sekhar Tatikonda
    James S. Duncan
    International Conference on Learning Representations (ICLR) (2022)
    Preview abstract The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a perturbed loss defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a surrogate gap, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of neighborhood (to derive the perturbed loss) is small. The surrogate gap is easy to compute and feasible for direct minimization during training. Based on the above observations, we propose Surrogate Gap Guided Sharpness-Aware Minimization (GSAM), a novel improvement over SAM with negligible computation overhead. Conceptually, GSAM consists of two steps: 1) a gradient descent like SAM to minimize the perturbed loss, and 2) an ascent step in the orthogonal direction (after gradient decomposition) to minimize the surrogate gap and yet not affect the perturbed loss. GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities. Theoretically, we show the convergence of GSAM and provably better generalization than SAM.Empirically, GSAM consistently improves generalization (e.g., +3.2% over SAM and +5.4% over AdamW on ImageNet top-1 accuracy for ViT-B/32). Code is released at https://sites.google.com/view/gsam-iclr22/home. View details
    Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization
    Yuxiao Wang
    Jiaping Zhao
    Liangzhe Yuan
    Jennifer Jianing Sun
    Xi Peng
    Dimitris N. Metaxas
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. We further propose two regularization terms to ensure disentanglement and smoothness of the learned representations. The resulting pose representations can be used for cross-view action recognition. To evaluate the power of the learned representations, in addition to the conventional fully-supervised action recognition settings, we introduce a novel task called single-shot cross-view action recognition. This task trains models with actions from only one single viewpoint while models are evaluated on poses captured from all possible viewpoints. We evaluate the learned representations on standard benchmarks for action recognition, and show that (i) CV-MIM performs competitively compared with the state-of-the-art models in the fully-supervised scenarios;(ii) CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting;(iii) and the learned representations can significantly boost the performance when reducing the amount of supervised training data. Our code is made publicly available at https://github. com/google-research/google-research/tree/master/poem. View details
    Preview abstract Videos can evoke a range of affective responses in viewers. The ability to predict evoked affect from a video, before viewers watch the video, can help in content creation and video recommendation. We introduce the Evoked Expressions from Videos (EEV) dataset, a large-scale dataset for studying viewer responses to videos. Each video is annotated at 6 Hz with 15 continuous evoked expression labels, corresponding to the facial expression of viewers who reacted to the video. We use an expression recognition model within our data collection framework to achieve scalability. In total, there are 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours). We use a publicly available video corpus to obtain a diverse set of video content. We establish baseline performance on the EEV dataset using an existing multimodal recurrent model. Transfer learning experiments show an improvement in performance on the LIRIS-ACCEDE video dataset when pre-trained on EEV. We hope that the size and diversity of the EEV dataset will encourage further explorations in video understanding and affective computing. A subset of EEV is released at https://github.com/google-research-datasets/eev. View details
    Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness
    Xi Peng
    Dimitris N. Metaxas
    Advances in Neural Information Processing Systems (NeurIPS) (2020)
    Preview abstract Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin. View details
    View-Invariant Probabilistic Embedding for Human Pose
    Jennifer Jianing Sun
    Jiaping Zhao
    Liang-Chieh Chen
    European Conference on Computer Vision (ECCV) (2020)
    Preview abstract Depictions of similar human body configurations can vary with changing viewpoints. Using only 2D information, we would like to enable vision algorithms to recognize similarity in human body poses across multiple views. This ability is useful for analyzing body movements and human behaviors in images and videos. In this paper, we propose an approach for learning a compact view-invariant embedding space from 2D joint keypoints alone, without explicitly predicting 3D poses. Since 2D poses are projected from 3D space, they have an inherent ambiguity, which is difficult to represent through a deterministic mapping. Hence, we use probabilistic embeddings to model this input uncertainty. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 2D-to-3D pose lifting models. We also demonstrate the effectiveness of applying our embeddings to view-invariant action recognition and video alignment. Our code is available at https://github.com/google-research/google-research/tree/master/poem. View details
    Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
    Bowen Cheng
    Maxwell D. Collins
    Yukun Zhu
    Thomas S. Huang
    Liang-Chieh Chen
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025×2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several top-down approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation. View details
    Weakly Supervised Action Localization by Sparse Temporal Pooling Network
    Phuc Nguyen
    Bohyung Han
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6752-6761
    Preview abstract We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision. View details
    GLA in MediaEval 2018 Emotional Impact of Movies Task
    Jennifer Jianing Sun
    MediaEval 2018 Multimedia Benchmark Workshop
    Preview abstract The visual and audio information from movies can evoke a variety of emotions in viewers. Towards a better understanding of viewer impact, we present our methods for the MediaEval 2018 Emotional Impact of Movies Task to predict the expected valence and arousal continuously in movies. This task, using the LIRIS-ACCEDE dataset, enables researchers to compare different approaches for predicting viewer impact from movies. Our approach leverages image, audio, and face based features computed using pre-trained neural networks. These features were computed over time and modeled using a gated recurrent unit (GRU) based network followed by a mixture of experts model to compute multiclass predictions. We smoothed these predictions using a Butterworth filter for our final result. Our method enabled us to achieve top performance in three evaluation metrics in the MediaEval 2018 task. View details
    No Results Found