Yang Song
Research Areas
Authored Publications
Sort By
Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
Jingxiao Zheng
Xinwei Shi
Alexander Gorban
Junhua Mao
Charles Qi
Visesh Chari
Andre Cornman
Yin Zhou
Dragomir Anguelov
CVPR'2022, Workshop on Autonomous Driving, IEEE
Preview abstract
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive.
In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and
camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.
View details
Improving 3D Object Detection through Progressive Population Based Augmentation
Shuyang Cheng
Zhaoqi Leng
Barret Richard Zoph
Chunyan Bai
Jiquan Ngiam
Vijay Vasudevan
Jon Shlens
Drago Anguelov
ECCV'2020
Preview abstract
Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples.
View details
Geo-Aware Networks for Fine-Grained Recognition
Grace Chu
Brian Potetz
Weijun Wang
Andrew Howard
Fernando Andres Brucher
ICCV 2019
Preview abstract
Fine grained recognition distinguishes among categories with subtle visual differences. To help identify fine grained categories, other information besides images has been used. However, there has been little effort on using geolocation information to improve fine grained classification accuracy. Our contributions to this field are twofold. First, to the best of our knowledge, this is the first paper which systematically examined various ways of incorporating geolocation information to fine grained images classification - from geolocation priors, to post-processing, to feature modulation. Secondly, to overcome the situation where no fine grained dataset has complete geolocation information, we introduce, and will make public, two fine grained datasets with geolocation by providing complementary information to existing popular datasets - iNaturalist and YFCC100M. Results on these datasets show that, the best geo-aware network can achieve 8.9% top-1 accuracy increase on iNaturalist and 5.9% increase on YFCC100M, compared with image only models' results. In addition, for small image baseline models like Mobilenet V2, the best geo-aware network gives 12.6% higher top-1 accuracy than image only model, achieving even higher performance than Inception V3 models without geolocation. Our work gives incentives to use geolocation information to improve fine grained recognition for both server and on-device models.
View details
Preview abstract
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.
View details
The iNaturalist Species Classification and Detection Dataset
Grant Van Horn
Oisin Mac Aodha
Yin Cui
Alex Shepard
Pietro Perona
Serge Belongie
CVPR (2018)
Preview abstract
Existing image classification datasets used in computer
vision tend to have a uniform distribution of images across
object categories. In contrast, the natural world is heavily
imbalanced, as some species are more abundant and easier
to photograph than others. To encourage further progress in
challenging real world conditions we present the iNaturalist
species classification and detection dataset, consisting of
859,000 images from over 5,000 different species of plants
and animals. It features visually similar species, captured
in a wide variety of situations, from all over the world. Images
were collected with different camera types, have varying
image quality, feature a large class imbalance, and have
been verified by multiple citizen scientists. We discuss the
collection of the dataset and present extensive baseline experiments
using state-of-the-art computer vision classification
and detection models. Results show that current nonensemble
based methods achieve only 67% top one classification
accuracy, illustrating the difficulty of the dataset.
Specifically, we observe poor results for classes with small
numbers of training examples suggesting more attention is
needed in low-shot learning.
View details
Preview abstract
Transferring the knowledge learned from large scale
datasets (e.g., ImageNet) via fine-tuning offers an effective
solution for domain-specific fine-grained visual categorization
(FGVC) tasks (e.g., recognizing bird species or car
make & model). In such scenarios, data annotation often
calls for specialized domain knowledge and thus difficult to
scale. In this work, we first tackle a problem in large scale
FGVC. Our method won first place in iNaturalist 2017 large
scale species classification challenge. Central to the success
of our approach is a training scheme that uses higher
image resolution and deals with the long-tailed distribution
of training data. Next, we study transfer learning via
fine-tuning from large scale datasets to small scale, domainspecific
FGVC datasets. We propose a measure to estimate
domain similarity via Earth Mover’s Distance and demonstrate
that transfer learning benefits from pre-training on a
source domain that is similar to the target domain by this
measure. Our proposed transfer learning outperforms ImageNet
pre-training and obtains state-of-the-art results on
multiple commonly used FGVC datasets.
View details
Speed and accuracy trade-offs for modern convolutional object detectors
Anoop Korattikara
Jonathan Huang
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
CVPR 2017, Honolulu, Hawaii (2017)
Preview abstract
The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
View details
Learning Unified Embedding for Apparel Recognition
Yuan Li
Bo Wu
Chao-Yeh Chen
Xiao Zhang
ICCV Computational Fashion Workshop, IEEE (2017)
Preview abstract
In apparel recognition, specialized models (e.g. models
trained for a particular vertical like dresses) can signifi-
cantly outperform general models (i.e. models that cover
a wide range of verticals). Therefore, deep neural network
models are often trained separately for different verticals
(e.g. [7]). However, using specialized models for different
verticals is not scalable and expensive to deploy. This paper
addresses the problem of learning one unified embedding
model for multiple object verticals (e.g. all apparel classes)
without sacrificing accuracy. The problem is tackled from
two aspects: training data and training difficulty. On the
training data aspect, we figure out that for a single model
trained with triplet loss, there is an accuracy sweet spot in
terms of how many verticals are trained together. To ease
the training difficulty, a novel learning scheme is proposed
by using the output from specialized models as learning targets
so that L2 loss can be used instead of triplet loss. This
new loss makes the training easier and make it possible for
more efficient use of the feature space. The end result is
a unified model which can achieve the same retrieval accuracy
as a number of separate specialized models, while
having the model complexity as one. The effectiveness of
our approach is shown in experiments.
View details
G-RMI Object Detection
Anoop Korattikara
Jonathan Huang
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
Preview abstract
We present our submission to the COCO 2016 Object Detection challenge.
View details
Preview abstract
In this paper we address the issue of output instability
of deep neural networks: small perturbations in the visual
input can significantly distort the feature embeddings and
output of a neural network. Such instability affects many
deep architectures with state-of-the-art performance on a
wide range of computer vision tasks. We present a general
stability training method to stabilize deep networks against
small input distortions that result from various types of common
image processing, such as compression, rescaling, and
cropping. We validate our method by stabilizing the stateof-the-art
Inception architecture [11] against these types of
distortions. In addition, we demonstrate that our stabilized
model gives robust state-of-the-art performance on largescale
near-duplicate detection, similar-image ranking, and
classification on noisy datasets.
View details