File size: 4,080 Bytes

---
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
- sdxl
- stable-diffusion
- adaptive-cantor
- geometric-fusion
license: mit
---

# VAE Lyra 🎵 - Adaptive Cantor Edition

Multi-modal Variational Autoencoder for SDXL text embedding transformation using adaptive Cantor fractal fusion with learned alpha (visibility) and beta (capacity) parameters.

Fuses CLIP-L, CLIP-G, and decoupled T5-XL scales into a unified latent space.

## Model Details

- **Fusion Strategy**: adaptive_cantor
- **Latent Dimension**: 2048
- **Training Steps**: 78,750
- **Best Loss**: 0.2336

## Learned Parameters

**Alpha (Visibility):**
- clip_g: 0.7291
- clip_l: 0.7280
- t5_xl_g: 0.7244
- t5_xl_l: 0.7161

**Beta (Capacity):**
- clip_l_t5_xl_l: 0.5726
- clip_g_t5_xl_g: 0.5744


## Architecture

- **Modalities** (with sequence lengths): 
  - CLIP-L (768d @ 77 tokens) - SDXL text_encoder
  - CLIP-G (1280d @ 77 tokens) - SDXL text_encoder_2  
  - T5-XL-L (2048d @ 512 tokens) - Auxiliary for CLIP-L
  - T5-XL-G (2048d @ 512 tokens) - Auxiliary for CLIP-G
- **Encoder Layers**: 3
- **Decoder Layers**: 3
- **Hidden Dimension**: 1024
- **Cantor Depth**: 8
- **Local Window**: 3

## Key Features

### Adaptive Cantor Fusion
- **Cantor Fractal Routing**: Sparse attention based on fractal coordinate mapping
- **Learned Alpha (Visibility)**: Per-modality parameters controlling latent space usage (tied to KL divergence)
- **Learned Beta (Capacity)**: Per-binding-pair parameters controlling source influence strength

### Decoupled T5 Scales
- T5-XL-L binds specifically to CLIP-L (weight: 0.3)
- T5-XL-G binds specifically to CLIP-G (weight: 0.3)
- Independent T5 representations allow specialized semantic enrichment per CLIP encoder

### Variable Sequence Lengths
- CLIP: 77 tokens (standard)
- T5: 512 tokens (extended context for richer semantic capture)

## SDXL Compatibility

This model outputs both CLIP embeddings needed for SDXL:
- `clip_l`: [batch, 77, 768] → text_encoder output
- `clip_g`: [batch, 77, 1280] → text_encoder_2 output

T5 information is encoded into the latent space and influences both CLIP outputs through learned binding weights.

## Usage
```python
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra-xl-adaptive-cantor",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={
        "clip_l": 768,
        "clip_g": 1280,
        "t5_xl_l": 2048,
        "t5_xl_g": 2048
    },
    modality_seq_lens={
        "clip_l": 77,
        "clip_g": 77,
        "t5_xl_l": 512,
        "t5_xl_g": 512
    },
    binding_config={
        "clip_l": {"t5_xl_l": 0.3},
        "clip_g": {"t5_xl_g": 0.3},
        "t5_xl_l": {},
        "t5_xl_g": {}
    },
    latent_dim=2048,
    fusion_strategy="adaptive_cantor",
    cantor_depth=8,
    cantor_local_window=3
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model - train on all four modalities
inputs = {
    "clip_l": clip_l_embeddings,     # [batch, 77, 768]
    "clip_g": clip_g_embeddings,     # [batch, 77, 1280]
    "t5_xl_l": t5_xl_l_embeddings,   # [batch, 512, 2048]
    "t5_xl_g": t5_xl_g_embeddings    # [batch, 512, 2048]
}

# For SDXL inference - only decode CLIP outputs
recons, mu, logvar, per_mod_mus = model(inputs, target_modalities=["clip_l", "clip_g"])

# Use recons["clip_l"] and recons["clip_g"] with SDXL
```

## Training Details

- Trained on 10,000 diverse prompts
- Mix of LAION flavors (85%) and synthetic prompts (15%)
- KL Annealing: True
- Learning Rate: 0.0001
- Alpha Init: 1.0
- Beta Init: 0.3

## Citation
```bibtex
@software{vae_lyra_adaptive_cantor_2025,
  author = {AbstractPhil},
  title = {VAE Lyra: Adaptive Cantor Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra-xl-adaptive-cantor}
}
```