VIGS-SLAM

Visual Inertial Gaussian Splatting SLAM

Zihan Zhu1     Wei Zhang2     Norbert Haala 2     Marc Pollefeys1,3   Daniel Barath1    
1ETH Zürich     2University of Stuttgart   3Microsoft  

TL;DR: We present VIGS-SLAM, a tightly coupled visual–inertial 3D Gaussian Splatting SLAM system that delivers robust real-time tracking and high-fidelity reconstruction, achieving state-of-the-art results across four challenging datasets.

Abstract

We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly refining camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on four challenging datasets demonstrate our superiority over state-of-the-art methods.

Video

Method

VIGS-SLAM takes as input a sequence of RGB frames and IMU readings, and simultaneously estimates camera poses while building a 3D Gaussian map G. Keyframes are selected based on optical flow, and each new keyframe is initialized using the IMU pre-integration from the previous keyframe to the current one. This keyframe is then added to the local frame graph, where visual-inertial bundle adjustment jointly optimizes camera poses, depths, and IMU parameters. Visual correspondences are iteratively refined using a recurrent ConvGRU module. In parallel, a global pose graph is maintained using relative pose constraints from the frontend tracking. For Gaussian mapping, the depth of each new keyframe is unprojected into 3D using the estimated pose, converted into initial Gaussians, and fused into the global map. Both color and depth re-rendering losses are used to refine the Gaussians. Loop closure detection is performed based on optical flow differences between the new keyframe and all previous ones. When a loop is detected, pose graph bundle adjustment is performed, followed by an efficient Gaussian update to maintain global consistency.

BibTeX

@article{ZHU2025VIGS-SLAM,
      title={VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM},
      author={Zhu, Zihan and Zhang, Wei and Haala, Norbert and Pollefeys, Marc and Barath, Daniel},
      journal={arXiv preprint},
      year={2025}
}