Congcong Li
            http://chenlab.ece.cornell.edu/people/congcong/
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Jingxiao Zheng
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Xinwei Shi
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexander Gorban
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Junhua Mao
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Charles Qi
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Visesh Chari
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Andre Cornman
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yin Zhou
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dragomir Anguelov
                      
                    
                  
              
            
          
          
          
          
            CVPR'2022, Workshop on Autonomous Driving, IEEE
          
          
        
        
        
          
              Preview abstract
          
          
              3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive.
In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and
camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.
              
  
View details
          
        
      
    
        
          
            
              Improving 3D Object Detection through Progressive Population Based Augmentation
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Shuyang Cheng
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Zhaoqi Leng
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ekin Dogus Cubuk
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Barret Richard Zoph
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chunyan Bai
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jiquan Ngiam
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vijay Vasudevan
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jon Shlens
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Drago Anguelov
                      
                    
                  
              
            
          
          
          
          
            ECCV'2020
          
          
        
        
        
          
              Preview abstract
          
          
              Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples.
              
  
View details
          
        
      
    
        
          
            
              Learning semantic relationships for better action retrieval in images
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Vignesh Ramanathan
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jia Deng
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Han
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhen Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Kunlong Gu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Samy Bengio
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chuck Rosenberg
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Li Fei-Fei
                      
                    
                  
              
            
          
          
          
          
            CVPR (2015)
          
          
        
        
        
          
              Preview abstract
          
          
              Human actions capture a wide variety of interactions
between people and objects. As a result, the set of possible
actions is extremely large and it is difficult to obtain
sufficient training examples for all actions. However, we
could compensate for this sparsity in supervision by leveraging
the rich semantic relationship between different actions.
A single action is often composed of other smaller
actions and is exclusive of certain others. We need a method
which can reason about such relationships and extrapolate
unobserved actions from known actions. Hence, we propose
a novel neural network framework which jointly extracts
the relationship between actions and uses them for
training better action retrieval models. Our model incorporates
linguistic, visual and logical consistency based cues
to effectively identify these relationships. We train and test
our model on a largescale image dataset of human actions.
We show a significant improvement in mean AP compared
to different baseline methods including the HEX-graph approach
from Deng et al. [8]
              
  
View details