4DiM: Controlling Space and Time with Diffusion Models

submited by

Style Pass

2024-11-15 17:00:05

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS) of scenes, conditioned on one or more images and a set of camera poses and timestamps for the known and unknown views. To overcome the challenges due to limited availability of 4D training data, we advocate joint training on 3D (poses only), 4D (pose+time) and video (time only) data and propose a new architecture that enables the same. We further propose calibrating SfM posed data using monocular metric depth estimators for metric scale camera control. We introduce new metrics to enrich and overcome shortcomings of current evaluation schemes and achieve state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS with the additional capability of handling temporal dynamics. We also showcase zero-shot applications, including improved panorama stitching and rendering space-time trajectories from novel viewpoints.

Given a single image, 4DiM generates images for different 3D camera trajectories. The input images below are from our evaluation set, other public datasets, the internet, or from a text-to-image model.