Steven Hickson
Steven is currently also a PhD student at Georgia Institute of Technology.
His work focuses on computer vision and machine learning for semantic understanding of temporal environments. He has collaborated both with Google Brain and the Machine Intelligence and Perception teams.
Research Areas
Authored Publications
Sort By
Sharing Decoders: Network Fission for Multi-task Pixel Prediction
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, IEEE/CVF (2022), pp. 3771-3780
Preview abstract
We examine the benefits of splitting encoder-decoders for
multitask learning and showcase results on three tasks (semantics, surface normals, and depth) while adding very few
FLOPS per task. Current hard parameter sharing methods for multi-task pixel-wise labeling use one shared encoder with separate decoders for each task. We generalize
this notion and term the splitting of encoder-decoder architectures at different points as fission. Our ablation studies on fission show that sharing most of the decoder layers in multi-task encoder-decoder networks results in improvement while adding far fewer parameters per task. Our
proposed method trains faster, uses less memory, results in
better accuracy, and uses significantly fewer floating point
operations (FLOPS) than conventional multi-task methods,
with additional tasks only requiring 0.017% more FLOPS
than the single-task network.
View details
Floors are flat: Leveraging Semantics for Reliable and Real-Time Surface Normal Prediction
Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Preview abstract
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image.
These insights are: (1) denoise the ”ground truth” surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved state of the art results on several datasets, using a model that runs at 12 fps on a standard mobile phone.
View details
Eyemotion: Classifying facial expressions in VR using eye-tracking cameras
Nick Dufour
arXiv, https://arxiv.org/abs/1707.07204 (2017)
Preview abstract
One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. Hence, auxiliary means of sensing and conveying these expressions are needed. We present an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user's eyes captured from an IR gaze-tracking camera within a VR headset are sufficient to infer a select subset of facial expressions without the use of any fixed external camera. Using these inferences, we can generate dynamic avatars in real-time which function as an expressive surrogate for the user. We propose a novel data collection pipeline as well as a novel approach for increasing CNN accuracy via personalization. Our results show a mean accuracy of 74% (F1 of 0.73) among 5 `emotive' expressions and a mean accuracy of 70% (F1 of 0.68) among 10 distinct facial action units, outperforming human raters.
View details
Object category learning and retrieval with weak supervision
NIPS Workshop on Learning With Limited Labeled Data (2017)
Preview abstract
We consider the problem of retrieving objects from image data and learning to
classify them into meaningful semantic categories with minimal supervision. To
that end, we propose a fully differentiable unsupervised deep clustering approach
to learn semantic classes in an end-to-end fashion without individual class labeling
using only unlabeled object proposals. The key contributions of our work are 1)
a kmeans clustering objective where the clusters are learned as parameters of the
network and are represented as memory units, and 2) simultaneously building a
feature representation, or embedding, while learning to cluster it. This approach
shows promising results on two popular computer vision datasets: on CIFAR10 for
clustering objects, and on the more complex and challenging Cityscapes dataset
for semantically discovering classes which visually correspond to cars, people, and
bicycles. Currently, the only supervision provided is segmentation objectness masks,
but this method can be extended to use an unsupervised objectness-based object
generation mechanism which will make the approach completely unsupervised.
View details
Unsupervised deep clustering for semantic object retrieval
Baylearn, http://www.baylearn.org/ (2017)
Preview abstract
Learning a set of diverse and representative features from a large set of unlabeled
data has long been an area of active research. We present a method that separates
proposals of potential objects into semantic classes in an unsupervised manner.
Our preliminary results show that different object categories emerge and can later
be retrieved from test images. We propose a differentiable clustering approach
which can be integrated with Deep Neural Networks to learn semantic classes in
end-to-fashion without manual class labeling.
View details