Abstract
A unified visual tokenizer pre-training framework (VTP) improves generative performance by optimizing image-text contrastive, self-supervised, and reconstruction losses, leading to better scaling properties and higher zero-shot accuracy and faster convergence.
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.
Community
GitHub codes: https://github.com/MiniMax-AI/VTP
Huggingface weights: https://huggingface.co/collections/MiniMaxAI/vtp
collaborated with HUST Vision Lab: https://github.com/hustvl
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks (2025)
- DINO-Tok: Adapting DINO for Visual Tokenizers (2025)
- InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision (2025)
- Visual Generation Tuning (2025)
- Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank (2025)
- RecTok: Reconstruction Distillation along Rectified Flow (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This gets at a core bottleneck in generative vision: tokenizers optimized for pixels don’t scale cognition. VTP’s shift toward semantic-first latent spaces mirrors what we’ve already learned in language — understanding must precede generation. The fact that generation quality now scales with tokenizer pretraining FLOPs is the real breakthrough here. This feels like a necessary correction to the “just reconstruct better” era of VAEs and a strong signal that vision models are finally being trained to think, not just compress.
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper