LoL: Longer than Longer, Scaling Video Generation to Hour
Abstract
Researchers developed a method to overcome sink-collapse in autoregressive video generation by addressing the conflict between Rotary Position Embedding and multi-head attention mechanisms, enabling real-time streaming of videos up to 12 hours long.
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
Community
Scaling up video generation to hour long, please checkout our paper at: https://arxiv.org/abs/2601.16914 Project Page and code will released at: https://github.com/justincui03/LoL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression (2025)
- JoyAvatar-Flash: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion (2025)
- VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory (2025)
- Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation (2025)
- Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length (2025)
- Endless World: Real-Time 3D-Aware Long Video Generation (2025)
- PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper