Congcong Li
http://chenlab.ece.cornell.edu/people/congcong/
Research Areas
Authored Publications
Sort By
Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
Jingxiao Zheng
Xinwei Shi
Alexander Gorban
Junhua Mao
Charles Qi
Visesh Chari
Andre Cornman
Yin Zhou
Dragomir Anguelov
CVPR'2022, Workshop on Autonomous Driving, IEEE
Preview abstract
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive.
In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and
camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.
View details
Improving 3D Object Detection through Progressive Population Based Augmentation
Shuyang Cheng
Zhaoqi Leng
Barret Richard Zoph
Chunyan Bai
Jiquan Ngiam
Vijay Vasudevan
Jon Shlens
Drago Anguelov
ECCV'2020
Preview abstract
Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples.
View details
Learning semantic relationships for better action retrieval in images
Vignesh Ramanathan
Jia Deng
Wei Han
Zhen Li
Kunlong Gu
Samy Bengio
Chuck Rosenberg
Li Fei-Fei
CVPR (2015)
Preview abstract
Human actions capture a wide variety of interactions
between people and objects. As a result, the set of possible
actions is extremely large and it is difficult to obtain
sufficient training examples for all actions. However, we
could compensate for this sparsity in supervision by leveraging
the rich semantic relationship between different actions.
A single action is often composed of other smaller
actions and is exclusive of certain others. We need a method
which can reason about such relationships and extrapolate
unobserved actions from known actions. Hence, we propose
a novel neural network framework which jointly extracts
the relationship between actions and uses them for
training better action retrieval models. Our model incorporates
linguistic, visual and logical consistency based cues
to effectively identify these relationships. We train and test
our model on a largescale image dataset of human actions.
We show a significant improvement in mean AP compared
to different baseline methods including the HEX-graph approach
from Deng et al. [8]
View details