GaussFusion: Improving 3D Reconstruction in the Wild with
Geometry-Informed Video Generator

1Stanford University   2Zillow Group
CVPR 2026
arXiv
coming soon
Code
coming soon

Input

Output

Improving 3DGS reconstructions with an efficient, geometry-informed video generator.

Abstract

We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitives video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel view synthesis benchmarks, and an efficient variant runs in real time at 16 FPS while maintaining similar performance, enabling interactive 3D applications. We plan to release our code and model.

Method

Method architecture figure

GaussFusion Video Generator Architecture. Our model refines video latents using geometry-aware conditioning derived from 3D Gaussian splatting (3DGS). A Gaussian primitive buffer—comprising color, depth, normals, and uncertainty—is first encoded via a VAE and projected by a 3D convolution into a compact latent \(\mathbf{z}_{\mathcal{G}}\). The noised video latents are processed by a flow transformer built upon DiT blocks, interleaved with Geometry Adapter (GA) blocks. Each GA block fuses geometry features through self-attention and integrates textual scene descriptions via cross-attention, producing a geometry-aware feature \(\mathbf{x}_{\mathcal{G}}\) that modulates the video latent \(\mathbf{x}\). The model predicts the flow velocity \(v_\theta(x_t, t)\), which is integrated to recover refined video frames decoded by the VAE.

Multi-modal input (GP-buffer)

Visualization of the multi-modal input (GP-buffer) used by our model and our prediction.

Artifact Simulation

A key factor in training a generalizable rendering refinement model is a comprehensive data generation pipeline that accurately mimics the diverse artifacts observed in the real world. We construct paired videos of ground-truth and corrupted renderings by rendering each scene from two versions of its 3DGS reconstruction: a high-quality model from dense inputs and a corrupted one with degradation.

🎞️
Sparse-View Simulation

We randomly retain only 5% of the original video frames to simulate under-sampled captures. This random downsampling introduces temporal irregularity that better reflects real-world conditions.

🔀
Diverse Initialization

We apply multiple 3D Gaussian initialization strategies, including SfM point initialization, random 3D point cloud initialization, and dense point maps from MapAnything.

🎥
Paired Reconstruction with New Trajectories

A clean splat model is trained on all views, while a corrupted model uses the sparse subset with fewer optimization steps. Novel camera paths synthesize realistic motion artifacts.

Feed-forward Degradation

We render degraded videos using Gaussians from a pretrained feed-forward 3DGS model, introducing geometric inconsistencies, color shifts, and semi-transparent splats.

Novel-View Refinement

Comparison of novel-view synthesis results against baseline methods.

Refining Feed-forward reconstructions

Comparison of feed-forward reconstruction refinement from DepthSplat input.

Inference Efficiency Comparison

Comparison of inference speed (FPS) against baseline methods.

BibTeX

@inproceedings{zhu2026gaussfusion,
  title     = {GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator},
  author    = {Zhu, Liyuan and Narayana, Manjunath and Stary, Michal and Hutchcroft, Will and Wetzstein, Gordon and Armeni, Iro},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}