Abstract

Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community.

Issues of Semantic Confusion and Error Accumulation

Frame-level Caption and Attention

A. Keyframe Generation; Quick changes of multiple scenes (up to 6 scenes) in 5s short video)

B. First-Frame-Last-Frame-to-Video (5s short video)

C. Long Video in Complex Settinigs (5s video-level prompt Vs. 30s video-level prompt Vs. ours 30s frame-level prompt)

Dynamic prompts, particularly frame-level prompts, offer significant convenience during inference because each latent unit directly maps to a corresponding prompt. This characteristic makes them highly suitable for methods such as FIFO, which involves single-latent window slides per denoising step, and for chunk-level auto-regressive approaches that support variable chunk sizes. Our proposed Parallel Multi-Window Denoising (PMWD) method also leverages this: for generating very long sequences, each latent within every parallel denoising window aligns easily with its specific prompt, facilitating effective information exchange in overlapping regions. While these three approaches (FIFO, chunk-level auto-regression, and PMWD) are based on fixed-length sliding windows, this can present limitations. For instance, maintaining ID consistency can be challenging when the historical frames within the window are insufficient, or when objects disappear or are occluded in complex scenes. However, the primary focus of this paper is the flexibility and potential offered by frame-level prompts across the entire pipeline—encompassing dataset collection and construction, through to training and inference stages. Thus, we leave this issue for future work. Should enhanced ID preservation be a priority, our frame-level prompt system can be readily augmented with techniques that expand the historical context, such as those employed by FramePack or KV caching methods.

1. The camera enters a golden autumn forest, where the leaves have turned brilliant shades from gold to orange-red. A few leaves drift down with the wind. Sunlight filters through the gaps in the trees, casting dappled spots of light on the forest floor. The scene shifts to a field outside the forest, where ripe rice stalks sway in the breeze, their golden heads bowing under the weight of the grain. A few wild rabbits dart between the rows, occasionally pausing to nibble on the grass. The camera moves again to a winding stream, its water crystal clear, with a few fallen leaves floating gently on the surface. A soft breeze ripples the water, creating subtle waves. Continuing onward, the scene shifts to a hillside, where the distant mountain range is bathed in the warm light of autumn, and a village can be seen nestled at the foot of the mountains, with smoke curling from chimneys.

2. A young adventurer set off in search of five mysterious trial towers. He first arrived at the fire tower, solved the riddle of the fire giant, and earned the flame symbol. He then crossed the icy lands, faced the ice queen, and earned the frost symbol. Next, he entered an ancient temple, defeated the necromancer, and obtained the soul symbol. He then ventured into the thunder mountain range, underwent the lightning god's trial, and earned the thunder symbol. Finally, he reached the tower of light, faced the mysterious celestial being, and earned the light symbol. The adventurer combined the five symbols, uncovered a hidden world, and restored the lost glory.

Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Abstract

Issues of Semantic Confusion and Error Accumulation

Frame-level Caption and Attention

A. Keyframe Generation; Quick changes of multiple scenes (up to 6 scenes) in 5s short video)

B. First-Frame-Last-Frame-to-Video (5s short video)

C. Long Video in Complex Settinigs (5s video-level prompt Vs. 30s video-level prompt Vs. ours 30s frame-level prompt)