C. Long Video in Complex Settinigs (5s video-level prompt Vs. 30s video-level prompt Vs.
ours 30s frame-level prompt)
Dynamic prompts, particularly frame-level prompts, offer significant convenience during inference because each latent unit directly maps to a corresponding prompt. This characteristic makes them highly suitable for methods such as FIFO, which involves single-latent window slides per denoising step, and for chunk-level auto-regressive approaches that support variable chunk sizes. Our proposed Parallel Multi-Window Denoising (PMWD) method also leverages this: for generating very long sequences, each latent within every parallel denoising window aligns easily with its specific prompt, facilitating effective information exchange in overlapping regions.
While these three approaches (FIFO, chunk-level auto-regression, and PMWD) are based on fixed-length sliding windows, this can present limitations.
For instance, maintaining ID consistency can be challenging when the historical frames within the window are insufficient, or when objects disappear or are occluded in complex scenes.
However, the primary focus of this paper is the flexibility and potential offered by frame-level prompts across the entire pipeline—encompassing dataset collection and construction, through to training and inference stages.
Thus, we leave this issue for future work.
Should enhanced ID preservation be a priority, our frame-level prompt system can be readily augmented with techniques that expand the historical context, such as those employed by FramePack or KV caching methods.
1. The camera enters a golden autumn forest, where the leaves have turned brilliant shades from gold to
orange-red. A few leaves drift down with the wind. Sunlight filters through the gaps in the trees,
casting dappled spots of light on the forest floor. The scene shifts to a field outside the forest,
where ripe rice stalks sway in the breeze, their golden heads bowing under the weight of the grain.
A
few wild rabbits dart between the rows, occasionally pausing to nibble on the grass. The camera
moves
again to a winding stream, its water crystal clear, with a few fallen leaves floating gently on the
surface. A soft breeze ripples the water, creating subtle waves. Continuing onward, the scene shifts
to
a hillside, where the distant mountain range is bathed in the warm light of autumn, and a village
can be
seen nestled at the foot of the mountains, with smoke curling from chimneys.
2. A young adventurer set off in search of five mysterious trial towers. He first arrived at the fire
tower, solved the riddle of the fire giant, and earned the flame symbol. He then crossed the icy
lands,
faced the ice queen, and earned the frost symbol. Next, he entered an ancient temple, defeated the
necromancer, and obtained the soul symbol. He then ventured into the thunder mountain range,
underwent
the lightning god's trial, and earned the thunder symbol. Finally, he reached the tower of light,
faced
the mysterious celestial being, and earned the light symbol. The adventurer combined the five
symbols,
uncovered a hidden world, and restored the lost glory.