Researchers at InclusionAI (Ant Group) posted LLaDA2.0-Uni to arXiv on April 22, presenting a unified discrete diffusion LLM that handles both multimodal understanding and image generation through a single mask-token-prediction framework, using a MoE backbone and a dedicated diffusion decoder distilled to 8-step inference. The model matches specialized vision-language models on multimodal understanding benchmarks while also producing high-fidelity images and natively supporting interleaved generation and reasoning. Weights and code are available on GitHub under an open license.

LLaDA2.0-Uni unifies multimodal understanding and image generation in a single diffusion language model

Citations