Why SD VAE in 2025?

#3
by Luke2642 - opened

I'm curious, why did you spend compute training an SD vae based model in 2025?

The EQ-VAE trains faster, the DC-AE gives higher resolution for a smaller latent. Flux AE gives higher quality.

Apple org

This is for research purpose. We starts with SDVAE for fair comparison on ImageNet as well.

Thanks for the reply, I didn't expect it, I'd love to have a more detailed chat about it, if you're interested!

I understand comparisons and baselines are very important, but in the nicest possible framing, it's training a model with one hand tied behind your back! The problems with SD VAE are now much better understood than three years ago. Lots of research has showed downstream models learn faster when the a) natural image priors survive the dimensionality reduction step, and b) the latent manifold is well behaved.

There's far more work to be done there, preserving scale, rotation, mirroring, translation in the reduced manifold like EQ-VAE but also using the strengths of residual encoding like DC-AE.

image

Another "downstream thing to optimize for" is tokenisation for transformers, like this:

https://github.com/facebookresearch/SSDD

However, IMHO, I think there's an oversight in the community that the "discovery" of the pixel -> latent space mapping has thus far mostly been in a relatively small model. Sticking with the SD-VAE example, OpenAI addressed this with:

https://github.com/openai/consistencydecoder

But, the consistency decoder had the training objective priority backwards - keeping the same deeply flawed "latent API" of SD-VAE and attempting to patch it in the decoder, which ends up producing a very clever general reconstruction model that accepts a sort of "noisy input" in the form of the SD-VAE latent.

I have a hunch we need to the exact opposite, train an enormous model, a genius in semantic understanding, and high perceptual reconstruction quality, but strongly regularized to a very well behaved equivariant latent space. Then, distill it to a smaller model for all downstream model training and inference. Find the patterns separately to utilizing the knowledge! Like this:

https://huggingface.co/lightx2v/Autoencoders

Another low hanging fruit that I haven't seen any project take up yet is switching to a perceptually uniform colourspace before AE/VAE. Basically, using RGB, HSV, LAB etc uses the "wrong" colour difference formula for human perception. It's like every paper uses a PSNR measure that is the wrong "measuring stick", because everyone else uses the wrong measuring stick!

https://bottosson.github.io/posts/oklab/

Finally, I also think there is huge potential in baking more semantics into the latent space in the form of layering and depth, that goes far beyond just an alpha channel. There have been enormous strides in foundation vision models for monocular depth extraction, as well as image matting. With your resources, every training image could be decomposed into meaningful depth layers, or even just foreground and background, with excellent image matting around hair, fur, translucency, etc, the latent representation would be semantically far richer. One well designed representation could work across vector graphic alpha as well as natural images.

It does need careful attention though, naively adding an alpha channel to any colourspace is a illogical - 0% alpha black and 0% alpha white are the same 'singularity' on the manifold. I haven't found a satisfying solution to that though without invoking complexity like inverse rendering or some really nasty maths.

Anyway, I'm working on a paper on this exact thing, so it's a bit of a brain dump, a sneak preview!

Sign up or log in to comment