DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input

AAAI 2025

1Shanghai Jiao Tong University
2East China Normal University

Reconstruction results on nuScenes dataset.
Left: rendered RGB images; Right: rendered depth maps.

Abstract

We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.

Method

Overview of DrivingForward. Given sparse surround-view input from vehicle-mounted cameras (e.g., single-frame surround-view images or multi-frame surround-view images), our model learns scale-aware localization for Gaussian primitives from the small overlap of spatial and temporal context views. A Gaussian network predicts other parameters from each image individually. This feed-forward pipeline enables the real-time reconstruction of driving scenes and the independent prediction from single-frame images supports flexible input modes. At the inference stage, we include only the depth network and the Gaussian network, as shown in the lower part of the figure.

Interpolate start reference image

Qualitative Results

Interpolate start reference image


Interpolate start reference image

BibTeX

@inproceedings{tian2025drivingforward,
      title={DrivingForward: Feed-forward 3D Gaussian Splatting for Driving Scene Reconstruction from Flexible Surround-view Input}, 
      author={Qijian Tian and Xin Tan and Yuan Xie and Lizhuang Ma},
      booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
      year={2025}
}