Karthik Raveendran
Research Areas
Authored Publications
Sort By
Sharing Decoders: Network Fission for Multi-task Pixel Prediction
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, IEEE/CVF (2022), pp. 3771-3780
Preview abstract
We examine the benefits of splitting encoder-decoders for
multitask learning and showcase results on three tasks (semantics, surface normals, and depth) while adding very few
FLOPS per task. Current hard parameter sharing methods for multi-task pixel-wise labeling use one shared encoder with separate decoders for each task. We generalize
this notion and term the splitting of encoder-decoder architectures at different points as fission. Our ablation studies on fission show that sharing most of the decoder layers in multi-task encoder-decoder networks results in improvement while adding far fewer parameters per task. Our
proposed method trains faster, uses less memory, results in
better accuracy, and uses significantly fewer floating point
operations (FLOPS) than conventional multi-task methods,
with additional tasks only requiring 0.017% more FLOPS
than the single-task network.
View details
Efficient Heterogeneous Video Segmentation at the Edge
Jamie Lin
Siargey Pisarchyk
David Cong Tian
Tingbo Hou
Sixth Workshop on Computer Vision for AR/VR (CV4ARVR) (2022)
Preview abstract
We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute. Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures and operations on top of already light-weight backbones, targeting commercially available edge inference engines. We further analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU. Our approach has empirically factored well into our real-time AR system, enabling remarkably higher accuracy with quadrupled effective resolutions, yet at much shorter end-to-end latency, much higher frame rate, and even lower power consumption on edge platforms.
View details
Preview abstract
We present Attention Mesh, a lightweight architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Our neural network is designed for real-time on-device inference and runs at over 50 FPS on a Pixel 2 phone. This solution enables applications like AR makeup, eye tracking and puppeteering that rely on highly accurate landmarks for eye and lips regions. Our main contribution is a unified network architecture that achieves the same accuracy on facial landmarks as a multi-stage cascaded approach, while being 30 percent faster.
View details
BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
Valentin Bazarevsky
Andrey Vakunov
CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, Long Beach, CA (2019)
Preview abstract
We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200--1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression.
View details
Floors are flat: Leveraging Semantics for Reliable and Real-Time Surface Normal Prediction
Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Preview abstract
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image.
These insights are: (1) denoise the ”ground truth” surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved state of the art results on several datasets, using a model that runs at 12 fps on a standard mobile phone.
View details