Jump to Content
Yang Song

Yang Song

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
    Jingxiao Zheng
    Xinwei Shi
    Alexander Gorban
    Junhua Mao
    Charles Qi
    Visesh Chari
    Andre Cornman
    Yin Zhou
    Dragomir Anguelov
    CVPR'2022, Workshop on Autonomous Driving, IEEE
    Preview abstract 3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions. View details
    Improving 3D Object Detection through Progressive Population Based Augmentation
    Shuyang Cheng
    Zhaoqi Leng
    Barret Richard Zoph
    Chunyan Bai
    Jiquan Ngiam
    Vijay Vasudevan
    Jon Shlens
    Drago Anguelov
    ECCV'2020
    Preview abstract Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples. View details
    Class-Balanced Loss Based on Effective Number of Samples
    Yin Cui
    Menglin Jia
    Tsung-Yi Lin
    Serge Belongie
    CVPR (2019)
    Preview abstract With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets. View details
    Geo-Aware Networks for Fine-Grained Recognition
    Grace Chu
    Brian Potetz
    Weijun Wang
    Andrew Howard
    Fernando Andres Brucher
    ICCV 2019
    Preview abstract Fine grained recognition distinguishes among categories with subtle visual differences. To help identify fine grained categories, other information besides images has been used. However, there has been little effort on using geolocation information to improve fine grained classification accuracy. Our contributions to this field are twofold. First, to the best of our knowledge, this is the first paper which systematically examined various ways of incorporating geolocation information to fine grained images classification - from geolocation priors, to post-processing, to feature modulation. Secondly, to overcome the situation where no fine grained dataset has complete geolocation information, we introduce, and will make public, two fine grained datasets with geolocation by providing complementary information to existing popular datasets - iNaturalist and YFCC100M. Results on these datasets show that, the best geo-aware network can achieve 8.9% top-1 accuracy increase on iNaturalist and 5.9% increase on YFCC100M, compared with image only models' results. In addition, for small image baseline models like Mobilenet V2, the best geo-aware network gives 12.6% higher top-1 accuracy than image only model, achieving even higher performance than Inception V3 models without geolocation. Our work gives incentives to use geolocation information to improve fine grained recognition for both server and on-device models. View details
    Preview abstract Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks (e.g., recognizing bird species or car make & model). In such scenarios, data annotation often calls for specialized domain knowledge and thus difficult to scale. In this work, we first tackle a problem in large scale FGVC. Our method won first place in iNaturalist 2017 large scale species classification challenge. Central to the success of our approach is a training scheme that uses higher image resolution and deals with the long-tailed distribution of training data. Next, we study transfer learning via fine-tuning from large scale datasets to small scale, domainspecific FGVC datasets. We propose a measure to estimate domain similarity via Earth Mover’s Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure. Our proposed transfer learning outperforms ImageNet pre-training and obtains state-of-the-art results on multiple commonly used FGVC datasets. View details
    The iNaturalist Species Classification and Detection Dataset
    Grant Van Horn
    Oisin Mac Aodha
    Yin Cui
    Alex Shepard
    Pietro Perona
    Serge Belongie
    CVPR (2018)
    Preview abstract Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier to photograph than others. To encourage further progress in challenging real world conditions we present the iNaturalist species classification and detection dataset, consisting of 859,000 images from over 5,000 different species of plants and animals. It features visually similar species, captured in a wide variety of situations, from all over the world. Images were collected with different camera types, have varying image quality, feature a large class imbalance, and have been verified by multiple citizen scientists. We discuss the collection of the dataset and present extensive baseline experiments using state-of-the-art computer vision classification and detection models. Results show that current nonensemble based methods achieve only 67% top one classification accuracy, illustrating the difficulty of the dataset. Specifically, we observe poor results for classes with small numbers of training examples suggesting more attention is needed in low-shot learning. View details
    Learning Unified Embedding for Apparel Recognition
    Yuan Li
    Bo Wu
    Xiao Zhang
    ICCV Computational Fashion Workshop, IEEE (2017)
    Preview abstract In apparel recognition, specialized models (e.g. models trained for a particular vertical like dresses) can signifi- cantly outperform general models (i.e. models that cover a wide range of verticals). Therefore, deep neural network models are often trained separately for different verticals (e.g. [7]). However, using specialized models for different verticals is not scalable and expensive to deploy. This paper addresses the problem of learning one unified embedding model for multiple object verticals (e.g. all apparel classes) without sacrificing accuracy. The problem is tackled from two aspects: training data and training difficulty. On the training data aspect, we figure out that for a single model trained with triplet loss, there is an accuracy sweet spot in terms of how many verticals are trained together. To ease the training difficulty, a novel learning scheme is proposed by using the output from specialized models as learning targets so that L2 loss can be used instead of triplet loss. This new loss makes the training easier and make it possible for more efficient use of the feature space. The end result is a unified model which can achieve the same retrieval accuracy as a number of separate specialized models, while having the model complexity as one. The effectiveness of our approach is shown in experiments. View details
    Preview abstract The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task. View details
    Preview abstract In this paper we address the issue of output instability of deep neural networks: small perturbations in the visual input can significantly distort the feature embeddings and output of a neural network. Such instability affects many deep architectures with state-of-the-art performance on a wide range of computer vision tasks. We present a general stability training method to stabilize deep networks against small input distortions that result from various types of common image processing, such as compression, rescaling, and cropping. We validate our method by stabilizing the stateof-the-art Inception architecture [11] against these types of distortions. In addition, we demonstrate that our stabilized model gives robust state-of-the-art performance on largescale near-duplicate detection, similar-image ranking, and classification on noisy datasets. View details
    G-RMI Object Detection
    Anoop Korattikara
    Menglong Zhu
    Vivek Rathod
    Zbigniew Wojna
    2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
    Preview abstract We present our submission to the COCO 2016 Object Detection challenge. View details
    Learning semantic relationships for better action retrieval in images
    Vignesh Ramanathan
    Jia Deng
    Wei Han
    Zhen Li
    Kunlong Gu
    Samy Bengio
    Chuck Rosenberg
    Li Fei-Fei
    CVPR (2015)
    Preview abstract Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We need a method which can reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action retrieval models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify these relationships. We train and test our model on a largescale image dataset of human actions. We show a significant improvement in mean AP compared to different baseline methods including the HEX-graph approach from Deng et al. [8] View details
    Learning Fine-grained Image Similarity with Deep Ranking
    Jiang Wang
    Chuck Rosenberg
    Jingbin Wang
    James Philbin
    Bo Chen
    Ying Wu
    CVPR'2014, IEEE
    Preview abstract Learning fine-grained image similarity is a challenging task. It needs to capture between-class and within-class image differences. This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images. It has higher learning capability than models based on hand-crafted features. A novel multiscale network structure has been developed to describe the images effectively. An efficient triplet sampling algorithm is proposed to learn the model with distributed asynchronized stochastic gradient. Extensive experiments show that the proposed algorithm outperforms models based on hand-crafted visual features and deep classification models. View details
    Improving Video Classification via YouTube Video Co-Watch Data
    John Zhang
    ACM Workshop on Social and Behavioural Networked Media Access at ACM MM 2011, ACM
    Preview
    YouTubeEvent: On Large-Scale Video Event Classification
    Bingbing Ni
    The 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications at IEEE ICCV'2011
    Preview
    Taxonomic Classification for Web-based Videos
    Xiaoyun Wu
    IEEE Conf on Computer Vision and Pattern Recognition (CVPR), IEEE (2010)
    Preview
    YouTubeCat: Learning to Categorize Wild Web Videos
    Zheshen Wang
    Baoxin Li
    IEEE Conf on Computer Vision and Pattern Recognition (CVPR) (2010)
    Preview
    A Large-Scale Taxonomic Classification System for Web-based Videos
    Reto Strobl
    John Zhang
    the 11th European Conference on Computer Vision (ECCV 2010)
    Preview
    Tour the World: building a web-scale landmark recognition engine
    Yantao Zheng
    Ulrich Buddemeier
    Fernando Brucher
    Tat-Seng Chua
    International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
    Preview
    Tour the world: a technical demonstration of a web-scale landmark recognition engine
    Yan-Tao Zheng
    Ulrich Buddemeier
    Fernando Brucher
    Tat-Seng Chua
    MM '09: Proceedings of the seventeen ACM international conference on Multimedia, ACM, New York, NY, USA (2009), pp. 961-962
    Preview
    Context-aided Human Recognition - Clustering
    Proc. of European Conferences on Computer Vision (2006)
    Unsupervised Learning of Human Motion Models
    Luis Goncalves
    Pietro Perona
    IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25 (2003), pp. 814-827
    A Probabilistic Approach to Human Motion Detection and Labeling
    Ph.D. Thesis, California Institute of Technology (2003)
    Monocular Perception of Biological Motion in Johansson Displays
    Luis Goncalves
    E. Di Bernardo
    Pietro Perona
    Computer Vision and Image Understanding, vol. 81 (2001), pp. 303-327
    Unsupervised Learning of Human Motion Models
    Luis Goncalves
    Pietro Perona
    Advances in Neural Information Processing Systems 14 (2001)
    Learning Probabilistic Structure for Human Motion Detection
    Luis Goncalves
    Pietro Perona
    Proc. IEEE Conf. Computer Vision and Pattern Recognition (2001)
    A computational model for motion detection and direction discrimination in humans
    Pietro Perona
    IEEE computer society workshop on Human Motion (2000), pp. 11-16
    Towards Detection of Human Motion
    Xiaolin Feng
    Pietro Perona
    Proc. IEEE Conf. Computer Vision and Pattern Recognition (2000)
    Monocular perception of biological motion - clutter and partial occlusion
    Luis Goncalves
    Pietro Perona
    Proc. of 6th European Conferences on Computer Vision (2000)
    Monocular perception of biological motion - detection and labeling
    L. Goncalves,
    E. Di Bernardo
    P. Perona
    Proc. of 7th International Conferences on Computer Vision (1999), pp. 805-812