Papers
arxiv:2606.29215

Multi-Block Diffusion Language Models

Published on Jun 30
· Submitted by
Yijie Jin
on Jul 1
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms.

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose Multi-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with Multi-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, with randomized noise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the Block Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

Community

Paper submitter

We introduce Multi-Block Diffusion Language Models (MBD-LMs), a unified framework that bridges the training-inference gap for practical multi-block diffusion in block diffusion language models (BD-LMs). We identify that existing Teacher Forcing and D2F paradigms fail to align with the bounded running-set and heterogeneous slot-wise noise patterns required by Multi-Block Diffusion (MultiBD) inference. To address this, we propose Multi-block Teacher Forcing (MultiTF), a lightweight post-training method that constructs bounded noise groups with randomized chain-uniform scheduling, enabling any BD-LM to upgrade into an MBD-LM. On the inference side, we design the Block Buffer mechanism to decouple dynamic running-sets from static physical shapes, enabling CUDA Graph capture and prefix KV cache reuse. Empirically, MBD-LLaDA2-Mini achieves a 78.4% TPF improvement (3.47 to 6.19) while improving accuracy from 79.95% to 81.03%. Combined with DMax, TPF reaches 9.34 with strong throughput gains. We also release Diffulex, a unified serving engine that supports MBD-LMs and various BD-LM strategies (SingleBD, MultiBD, Dual Cache, DMax, etc.) under a single backend.
Project page: https://sjtu-deng-lab.github.io/mbd-lms/
Training code: https://github.com/SJTU-DENG-Lab/mbd-lms
Inference engine (Diffulex): https://github.com/SJTU-DENG-Lab/Diffulex

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.29215
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 8

Browse 8 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29215 in a Space README.md to link it from this page.

Collections including this paper 3