Playing with Transformer at 30+ FPS via Next-Frame Diffusion


Xinle Cheng🏮🌲 Tianyu He🌲+ Jiayi Xu🏮 Junliang Guo🌲 Di He🏮 Jiang Bian🌲

🏮Peking University 🌲Microsoft

Paper arXiv Code (Coming Soon)

TL;DR: we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling efficient inference via few-step sampling and parallel token generation.



Videos generated by NFD+ 310M at 31 FPS on an A100 GPU.



Abstract

Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation.

To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.

Architecture

The architecture of NFD contains a tokenizer that transforms raw visual signals to latent representations, and a Diffusion Transformer (DiT) that generates these latents.

We propose a Block-wise Causal Attention mechanism that combines bidirectional attention within each frame and causal dependencies across frames to model spatio-temporal dependencies efficiently. In contrast to the computationally intensive 3D full attention, our approach reduces the overall cost by 50%, enabling hardware-efficient and streaming prediction of all tokens in the next frame in parallel.

Training and Sampling

Given a video frame xi , we assign an independent timestep t and generate a noised version via linear interpolation:

Training minimizes the following Flow Matching loss:

For sampling, we adopt DPM-Solver++, where we recover the denoised frame with:

Accelerated Sampling

We introduce a set of methodological advancements aimed at improving the sampling efficiency of NFD, while preserving high visual fidelity in the generated video content.

Consistency Distillation

We extend sCM to the video domain, and adapt it to the specific features of video data, where the training objective of the sCM becomes:

Speculative Sampling

We introduce a speculative sampling technique designed to accelerate inference by enabling parallel prediction of multiple future frames. After this speculative generation, we compare the predicted actions with the actual subsequent action inputs in the sequence. Once a discrepancy between the predicted and true actions is detected, all subsequent speculative frames beyond that point are discarded, and generation resumes from the last verified frame.

Main Results

We present a comparative analysis of our proposed method against state-of-the-art baselines in the following table, highlighting both sampling efficiency and visual quality of the generated videos.

Visualization: Details Aligned with Physical Properties.

Videos generated by NFD+ and MineWorld respectively, which illustrates a door-opening sequence. NFD+ accurately captures the door's geometry, maintaining its shape and structural integrity. In contrast, MineWorld introduces an artificial line between the two doors and fails to retain detail in the right portion of the door.

Visualization: Consistency Across Frames.

Videos generated by NFD+ and MineWorld respectively, which illustrates the superior temporal consistency achieved by NFD+. Despite a significant camera movement, NFD+ preserves a stable and coherent ground, whereas MineWorld introduces visible artifacts and distortions.

Citation

@misc{cheng2025playingtransformer30fps,
    title={Playing with Transformer at 30+ FPS via Next-Frame Diffusion}, 
    author={Xinle Cheng and Tianyu He and Jiayi Xu and Junliang Guo and Di He and Jiang Bian},
    year={2025},
    eprint={2506.01380},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.01380}, 
}