CamI2V: Camera-Controlled Image-to-Video Diffusion Model

*Under Review, WIP

Rethinking condition in diffusion models. Diffusion models denoise along the gradient of log probability density function. At large noise levels, the high density region becomes the overlap of numerous noisy samples, resulting in visual blurriness. We point out that the effectiveness of a condition depends on how much uncertainty it reduces. From a new perspective, we categorize conditions into clean conditions (e.g. texts, camera extrinsics) that remain visible throughout the denoising process, and noised conditions (e.g. noised pixels in the current and other frames) whose deterministic information $\alpha_t x_0$ will be gradually dominated by the randomness of noise $\sigma_t \epsilon$.

Abstract

Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera movements, dynamic objects, or occlusions, ensuring robust performance in diverse environments. Furthermore, we develop a more robust and reproducible evaluation pipeline to address the inaccuracies and instabilities of existing camera control metrics. Our method achieves a 25.64% improvement in camera controllability on the RealEstate10K dataset without compromising dynamics or generation quality and demonstrates strong generalization to out-of-domain images. Training and inference require only 24GB and 12GB of memory, respectively, for 16-frame sequences at 256\(\times\)256 resolution. We will release all checkpoints, along with training and evaluation code.

Visualization on Out-of-domain Images (1024\(\times\)576)

Coming Soon ...

More Visualization on Out-of-domain Images (512\(\times\)320)

*Generated by 512\(\times\)320 model (50k training steps), compatible with input images of arbitary aspect ratio.

Pan Left

Pan Right

Pan Up

Pan Down


Look Left

Look Right

Orbit Left

Orbit Right


Zoom In & Rotate

Pan Left & Zoom

Forward → Backward

Walking

Visualization on Out-of-domain Images (512\(\times\)320)

*Original outputs from 512\(\times\)320 model, no padding removed.

Visualization on Out-of-domain Images (256\(\times\)256)

Orbit Left

Orbit Right


Zoom In

Zoom Out

Visualization on RealEstate10K (256\(\times\)256)

Quantitative Comparison & Ablation Study

Method RotErr$\downarrow$ TransErr$\downarrow$ CamMC$\downarrow$ FVD$\downarrow$
(VideoGPT)
FVD$\downarrow$
(StyleGAN)
DynamiCrafter 3.3415 9.8024 11.625 106.02 92.196
+ MotionCtrl 0.8636 2.5068 2.9536 70.820 60.363
+ Plucker Embedding (Baseline, CameraCtrl) 0.7098 1.8877 2.2557 66.077 55.889
+ Plucker Embedding + Epipolar Attention Only on Reference Frame (CamCo-like) 0.5738 1.6014 1.8851 66.439 56.778
+ Plucker Embedding + Epipolar Attention (Our CamI2V) 0.4758 1.4955 1.7153 66.090 55.701
+ Plucker Embedding + 3D Full Attention 0.6299 1.8215 2.1315 71.026 60.00
+ Plucker Embedding + No Register Tokens + Epipolar Attention NaN NaN NaN NaN NaN

Adding more conditions to generative models typically reduces uncertainty and improves generation quality (e.g. providing detailed text conditions through recaption). In this paper, we argue that it is also crucial to consider noisy conditions like latent features $z_t$, which contain valuable information along with random noise. For instance, in SDEdit for image-to-image translation, random noise is added to the input $z_0$ to produce a noisy $z_t$. The clean component $z_0$ preserves overall similarity, while the introduced noise leads to uncertainty, enabling diverse and varied generations.

In this paper, we argue that providing the model with more noisy conditions, especially at high noise levels, does not necessarily reduce more uncertainty, as the noise also introduces randomness and misleadingness. This is the key insight we aim to convey. To validate this point, we designed experiment with the following setups:

  1. Plücker Embedding (Baseline): This setup, akin to CameraCtrl, has minimal noisy conditions on cross frames due to the inefficiency of the indirect cross-frame interaction (spatial and temporal attention).
  2. Plücker Embedding + Epipolar Attention only on reference frame: Similar to CamCo, this setup treats the reference frame as the source view, enabling the target frame to refer to it. It accesses a small amount of noisy conditions on the reference frame. However, some pixels of the current frame may have no epipolar line interacted with reference frame, causing it to degenerate to a CameraCtrl-like model without epipolar attention.
  3. Plücker Embedding + Epipolar Attention (Our CamI2V): This setup can impose epipolar constraints with all frames, including adjacent frames that have interactions in most cases to ensure an sufficient amount of noisy conditions.
  4. Plücker Embedding + 3D Full Attention: This configuration allows the model to directly interact with features of all other frames, accessing the most noisy conditions.

The amount of accessible noisy conditions of the above four setups increase progressively. One might expect that 3D full attention, which accesses the most noisy conditions, would achieve the best performance. However, as shown in the table, 3D full attention performs only slightly better than CameraCtrl and is inferior to CamCo-like setup who only applies epipolar attention on reference frame. Notably, our method achieves best result by interacting with noisy conditions more along the epipolar lines. It can be clearly seen below that CamCo-like setup reference too much on the first frame and cannot generate new objects. 3D full attention setup generates objects with large movement due to its access to pixels from all frames while the colors of specific pixels are affected by pixels of incorrect positions. These findings confirm our insight that an optimal amount of noisy conditions leads to better uncertainty reduction, rather than merely increasing the quantity of noisy conditions.

CamI2V (Ours)

CamI2V - 3D full attention

CamI2V - epipolar attention
only on reference frame
(similar to CamCo)

CameraCtrl

MotionCtrl

Attention Mechanisms for Tracking Displaced Noised Features

Temporal attention is limited to features at the same location of picture, rendering it ineffective for significant camera movements. In contrast, 3D full attention facilitates cross-frame tracking due to its broad receptive field. However, high noise levels can obscure deterministic information, hindering consistent tracking. Our proposed epipolar attention aggregates features along the epipolar line, effectively modeling cross-frame relationships even under high noise conditions.

Camera parameterization

Left: Camera representation and trajectory visualization in the world coordinate system.
Right: The transformation from camera representations to 3D ray representations as Plücker coordinates given pixel coordinates.

Epipolar line and mask

Left: Epipolar constraint of the $j$-th frame from one pixel at $(u,v)$ on the $i$-th frame.
Middle: Epipolar mask discretized by the distance threshold $\delta$, so that only neighboring pixels in green are allowed to attend while those red lined are not.
Right: Multi-resolution epipolar mask adaptive to the feature size in U-Net layers.

Epipolar attention mask with register tokens

We specify query pixel by red point in the $i$-th frame for clarity. Epipolar attention mask is constructed by concatenating epipolar masks along all frames. We insert register tokens to key/value sequence to deal with zero epipolar scenarios.

Pipeline of CamI2V