Jump to Content
Vincent Casser

Vincent Casser

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Taskology: Utilizing Task Relations at Scale
    Yao Lu
    Sören Pirk
    Jan Dlabal
    Anthony Brohan
    Ankita Pasad
    Zhao Chen
    Ariel Gordon
    Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract Many computer vision tasks address the problem of scene understanding and are naturally interrelated e.g. object classification, detection, scene segmentation, depth estimation, etc. We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly, supervising each other through their known relationships via consistency losses. Furthermore, explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data, and allows training with additional unsupervised or simulated data. We demonstrate a distributed joint training algorithm with task-level parallelism, which affords a high degree of asynchronicity and robustness. This allows learning across multiple tasks, or with large amounts of input data, at scale. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds. We observe improved performance across these tasks, especially in the low-label regime. View details
    4D-Net for Learned Multi-Modal Alignment
    Michael Ryoo
    International Conference on Computer Vision (ICCV) (2021)
    Preview abstract We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully. We will open source the code. View details
    Unsupervised Monocular Depth Learning in Dynamic Scenes
    Hanhan Li
    Ariel Gordon
    Hang Zhao
    Conference on Robot Learning (CoRL) (2020)
    Preview abstract We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily-underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including semantically-aware methods. The code is available at https://github.com/google-research/google-research/tree/master/depth_and_motion_learning. View details
    Semantically-Agnostic Unsupervised Monocular Depth Learning in Dynamic Scenes
    Hanhan Li
    Ariel Gordon
    Hang Zhao
    Workshop on Perception for Autonomous Driving, ECCV 2020 (2020)
    Preview abstract We present a method for jointly training the estimation of depth, egomotion, and a dense 3D translation field of objects, suitable for dynamic scenes containing multiple moving objects. Monocular photometric consistency is the sole source of supervision. We show that this apparently heavily-underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: They are sparse, since most of the scene is static, and they tend to be constant through rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work, including methods that require semantic input. View details
    Preview abstract Learning to predict scene depth from RGB inputs is a challenging task both for indoor and outdoor robot navigation. In this work we address unsupervised learning of scene depth and robot ego-motion where supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and most ubiquitous sensor for robotics. Previous work in unsupervised image-to-depth learning has established strong baselines in the domain. We propose a novel approach which produces higher quality results, is able to model moving objects and is shown to transfer across data domains, e.g. from outdoors to indoor scenes. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Furthermore an online refinement method is introduced to adapt learning on the fly to unknown domains. The proposed approach outperforms all state-of-the-art approaches, including those that handle motion e.g. through learned flow. Our results are comparable in quality to the ones which used stereo as supervision and significantly improve depth prediction on scenes and datasets which contain a lot of object motion. The approach is of practical relevance, as it allows transfer across environments, by transferring models trained on data collected for robot navigation in urban scenes to indoor navigation settings. The code associated with this paper can be found at https://sites.google.com/ view/struct2depth. View details
    Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics
    Soeren Pirk
    CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (VOCVALC) (2019)
    Preview abstract We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically, we model the motion of individual objects and learn their 3D motion vector jointly with depth and ego-motion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. [AAAI'19]. Code and models have been open sourced at: https://sites.google.com/view/struct2depth. View details
    Unsupervised monocular depth and ego-motion learning with structure and semantics
    Soeren Pirk
    CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (2019)
    Preview abstract We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically we model the motions of individual objects and learn their 3D motion vector jointly with depth and egomotion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. Code and models have been open sourced at: https://sites.google.com/corp/view/struct2depth. View details
    No Results Found