DreamDrive: Generative 4D Scene Modeling from Street View Images

1NVIDIA Research, 2University of Southern California

Abstract

Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency.

In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

Driving Everywhere in a Generative 4D World


From Street View Images to 4D Scenes

4dscene

Dynamic Video Synthesis via Neural Rendering

nvs

DreamDrive

teaser image.

DreamDrive tackles the challenge of synthesizing both generalizable and 3D-consistent visual observations. It enables scalable 4D scene generation directly from in-the-wild data without relying on costly data collection and annotations, making it a transformative step for training and testing autonomous driving systems in diverse real-world scenarios.

method image.

DreamDrive combines video diffusion models with 3D Gaussian splatting to elevate visual references into fully dynamic 4D scenes. A novel hybrid Gaussian representation ensures consistent modeling of static and dynamic elements, creating realistic, trajectory-conditioned driving videos.

From Street View Images to 4D Spatio-Temporal Driving Scenes

Humanoid-X pipeline image.

Video Generation vs. 4D Generation

Video Generation vs. 4D Generation

Motion Planning Support

Motion Planning Support

Scene Decomposition

Scene Decomposition

Comparison with 4D-GS

Comparison with 4D-GS

BibTeX

@article{dreamdrive,
  author    = {Mao, Jiageng and Li, Boyi and Ivanovic, Boris and Chen, Yuxiao and Wang, Yan and You, Yurong and Xiao, Chaowei and Xu, Danfei and Pavone, Marco and Wang, Yue},
  title     = {DreamDrive: Generative 4D Scene Modeling from Street View Images},
  journal   = {arXiv preprint arXiv:2501.00601},
  year      = {2025},
}