Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera movements, dynamic objects, or occlusions, ensuring robust performance in diverse environments. Furthermore, we develop a more robust and reproducible evaluation pipeline to address the inaccuracies and instabilities of existing camera control metrics. Our method achieves a 25.64% improvement in camera controllability on the RealEstate10K dataset without compromising dynamics or generation quality and demonstrates strong generalization to out-of-domain images. Training and inference require only 24GB and 12GB of memory, respectively, for 16-frame sequences at 256\(\times\)256 resolution. We will release all checkpoints, along with training and evaluation code.
Method | RotErr$\downarrow$ | TransErr$\downarrow$ | CamMC$\downarrow$ | FVD$\downarrow$ (VideoGPT) |
FVD$\downarrow$ (StyleGAN) |
DynamiCrafter | 3.3415 | 9.8024 | 11.625 | 106.02 | 92.196 |
+ MotionCtrl | 0.8636 | 2.5068 | 2.9536 | 70.820 | 60.363 |
+ Plucker Embedding (Baseline, CameraCtrl) | 0.7098 | 1.8877 | 2.2557 | 66.077 | 55.889 |
+ Plucker Embedding + Epipolar Attention Only on Reference Frame (CamCo-like) | 0.5738 | 1.6014 | 1.8851 | 66.439 | 56.778 |
+ Plucker Embedding + Epipolar Attention (Our CamI2V) | 0.4758 | 1.4955 | 1.7153 | 66.090 | 55.701 |
+ Plucker Embedding + 3D Full Attention | 0.6299 | 1.8215 | 2.1315 | 71.026 | 60.00 |
+ Plucker Embedding + No Register Tokens + Epipolar Attention | NaN | NaN | NaN | NaN | NaN |
Adding more conditions to generative models typically reduces uncertainty and improves generation quality (e.g. providing detailed text conditions through recaption). In this paper, we argue that it is also crucial to consider noisy conditions like latent features $z_t$, which contain valuable information along with random noise. For instance, in SDEdit for image-to-image translation, random noise is added to the input $z_0$ to produce a noisy $z_t$. The clean component $z_0$ preserves overall similarity, while the introduced noise leads to uncertainty, enabling diverse and varied generations.
In this paper, we argue that providing the model with more noisy conditions, especially at high noise levels, does not necessarily reduce more uncertainty, as the noise also introduces randomness and misleadingness. This is the key insight we aim to convey. To validate this point, we designed experiment with the following setups:
The amount of accessible noisy conditions of the above four setups increase progressively. One might expect that 3D full attention, which accesses the most noisy conditions, would achieve the best performance. However, as shown in the table, 3D full attention performs only slightly better than CameraCtrl and is inferior to CamCo-like setup who only applies epipolar attention on reference frame. Notably, our method achieves best result by interacting with noisy conditions more along the epipolar lines. It can be clearly seen below that CamCo-like setup reference too much on the first frame and cannot generate new objects. 3D full attention setup generates objects with large movement due to its access to pixels from all frames while the colors of specific pixels are affected by pixels of incorrect positions. These findings confirm our insight that an optimal amount of noisy conditions leads to better uncertainty reduction, rather than merely increasing the quantity of noisy conditions.