CamI2V: Camera-Controlled Image-to-Video Diffusion Model

*Under Review, WIP

Rethinking condition in diffusion models. Diffusion models denoise along the gradient of log probability density function. At large noise levels, the high density region becomes the overlap of numerous noisy samples, resulting in visual blurriness. We point out that the effectiveness of a condition depends on how much uncertainty it reduces. From a new perspective, we categorize conditions into clean conditions (e.g. texts, camera extrinsics) that remain visible throughout the denoising process, and noised conditions (e.g. noised pixels in the current and other frames) whose deterministic information $\alpha_t x_0$ will be gradually dominated by the randomness of noise $\sigma_t \epsilon$.

Abstract

Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. With optimization, only 24GB and 12GB is required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes.


Visualization on RealEstate10K

Visualization on out-of-domain images

Attention mechanisms for tracking displaced noised features

Temporal attention is limited to features at the same location of picture, rendering it ineffective for significant camera movements. In contrast, 3D full attention facilitates cross-frame tracking due to its broad receptive field. However, high noise levels can obscure deterministic information, hindering consistent tracking. Our proposed epipolar attention aggregates features along the epipolar line, effectively modeling cross-frame relationships even under high noise conditions.

Camera parameterization

Left: Camera representation and trajectory visualization in the world coordinate system.
Right: The transformation from camera representations to 3D ray representations as Plücker coordinates given pixel coordinates.

Epipolar line and mask

Left: Epipolar constraint of the $j$-th frame from one pixel at $(u,v)$ on the $i$-th frame.
Middle: Epipolar mask discretized by the distance threshold $\delta$, so that only neighboring pixels in green are allowed to attend while those red lined are not.
Right: Multi-resolution epipolar mask adaptive to the feature size in U-Net layers.

Epipolar attention mask with register tokens

We specify query pixel by red point in the $i$-th frame for clarity. Epipolar attention mask is constructed by concatenating epipolar masks along all frames. We insert register tokens to key/value sequence to deal with zero epipolar scenarios.

Pipeline of CamI2V