MeDiM: Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
This repository contains the official implementation of MeDiM, presented in the paper Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation.
π₯ Introduction
We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across different medical modalities without requiring modality-specific components. MeDiM unifies multiple generative tasks: it flexibly translates between images and text or jointly produces imageβreport pairs across domains in response to user prompts. It builds on a discrete diffusion framework that unifies vision and language representations by modeling their shared probabilistic distribution. To empower the diffusion process to support unified and versatile medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its rich prior knowledge and cross-modal reasoning abilities. Because MLLMs are trained with causal (autoregressive) masking while diffusion denoising benefits from bidirectional context, MeDiM introduces two key designs: 1) removing the causal attention mask to enable a fully bidirectional information flow essential for mutual alignment, and 2) injecting continuous timestep embeddings to make the MLLM aware of the diffusion steps. Extensive experiments validate MeDiM as a unified foundation model capable of high-fidelity medical generation across various modalities, including medical image generation (16.60 FID on MIMIC-CXR; 24.19 FID on PathGen) and report generation (0.2650 METEOR on MIMIC-CXR; 0.2580 METEOR on PathGen). In addition, the jointly generated medical image-report pairs improve downstream task performance (+6.43 % BLEU-1, +18.57 % BLEU-2, +31.58 % BLEU-3, and +4.80 % METEOR in PathGen), enabling the use of multimodal inputs and the production of coherent, clinically grounded outputs.
π§ββοΈ Framework
Overview of the MeDiM architecture. The framework integrates an MLLM backbone within a discrete diffusion process for unified medical multimodal generation. During the forward process, data is tokenized and diffused over timesteps. The MLLM is then trained to reverse this process. Key architectural adaptations, including causal attention removal, timestep embeddings, and AdaLN, adapt the autoregressive MLLM for the bidirectional denoising required for unified medical generation.
Getting Started and Training
For detailed instructions on installation, data preparation, and training procedures, please refer to the official GitHub repository. The repository also provides information on downloading pretrained weights.
π©Ί Citation
If you find this repository useful, please consider giving a star β and citation π:
@misc{mao2025discretediffusionmodelsmllms,
title={Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation},
author={Jiawei Mao and Yuhan Wang and Lifeng Chen and Can Zhao and Yucheng Tang and Dong Yang and Liangqiong Qu and Daguang Xu and Yuyin Zhou},
year={2025},
eprint={2510.06131},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.06131},
}