Large-scale, real-time visual-inertial localization revisited
Abstract
The overreaching goals in image-based localization are larger, better and faster. For the recent years approaches based on local features and sparse 3d point-cloud models have both dominated the benchmarks and seen successful realworld deployment. Recently end-to-end learned localization approaches have been proposed which show promising results on small and medium scale datasets. However the positioning accuracy, latency and compute requirements of these approaches remain an area of work. End-to-end learned approaches also typically require encoding the geometry of the environment in the model, which causes performance problems in large scale scenes and results in a hard to accomodate memory footprint. To deploy localization at world-scale we thus continue to rely on local features and sparse 3d models. We don’t only look at localization though. The goal is to build a scalable and robust end-to-end system including model building, compression, localization and client-side pose fusion for deployment at scale. Our method compresses appearance and geometry of the scene, allows for low-latency localization queries and efficient fusion, leading to scalability beyond what what has been previously demonstrated. In order to further improve efficiency we leverage a combination of priors, nearest neighbor search, geometric match culling and a cascaded pose candidate refinement step. This combination outperforms other approaches when working with large scale models. We demonstrate the effectiveness of our approach on a proof-of-concept system localizing 2.5 million images against models from four cities in different regions on the world achieving query latencies in the 200ms range.