Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

Shengjie Zhu
Xiaoming Liu
Vincent Chu
International Conference on 3D Vision (2026)

Abstract

Structure-from-Motion (SfM) is a classical 3D vision task for recovering camera parameters and scene geometry from multi-view images. Recent advances in deep learning enable accurate monocular depth estimation (MDE) that infers structure from a single image without depending on camera motion. But integrating MDE into SfM remains challenging. Unlike classical triangulated sparse pointclouds, MDE produces dense depthmaps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose a Marginalized Bundle Adjustment (MBA) to accommodate MDE error variance with its density. With MBA, we show that MDE depthmaps are sufficiently accurate to support SoTA or competitive results in Structure-from-Motion and camera relocalization. Our benchmark demonstrates consistent remarkable results from two-view, few-frames small multiview, to thousands-frames large multiview system. Our method highlights the significant potential of MDE on multi-view 3D vision tasks.
×