File size: 4,080 Bytes
6441624 5f93ebf 6441624 6070923 5f93ebf 21728d7 5f93ebf 6441624 5f93ebf 6441624 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
- sdxl
- stable-diffusion
- adaptive-cantor
- geometric-fusion
license: mit
---
# VAE Lyra 🎵 - Adaptive Cantor Edition
Multi-modal Variational Autoencoder for SDXL text embedding transformation using adaptive Cantor fractal fusion with learned alpha (visibility) and beta (capacity) parameters.
Fuses CLIP-L, CLIP-G, and decoupled T5-XL scales into a unified latent space.
## Model Details
- **Fusion Strategy**: adaptive_cantor
- **Latent Dimension**: 2048
- **Training Steps**: 78,750
- **Best Loss**: 0.2336
## Learned Parameters
**Alpha (Visibility):**
- clip_g: 0.7291
- clip_l: 0.7280
- t5_xl_g: 0.7244
- t5_xl_l: 0.7161
**Beta (Capacity):**
- clip_l_t5_xl_l: 0.5726
- clip_g_t5_xl_g: 0.5744
## Architecture
- **Modalities** (with sequence lengths):
- CLIP-L (768d @ 77 tokens) - SDXL text_encoder
- CLIP-G (1280d @ 77 tokens) - SDXL text_encoder_2
- T5-XL-L (2048d @ 512 tokens) - Auxiliary for CLIP-L
- T5-XL-G (2048d @ 512 tokens) - Auxiliary for CLIP-G
- **Encoder Layers**: 3
- **Decoder Layers**: 3
- **Hidden Dimension**: 1024
- **Cantor Depth**: 8
- **Local Window**: 3
## Key Features
### Adaptive Cantor Fusion
- **Cantor Fractal Routing**: Sparse attention based on fractal coordinate mapping
- **Learned Alpha (Visibility)**: Per-modality parameters controlling latent space usage (tied to KL divergence)
- **Learned Beta (Capacity)**: Per-binding-pair parameters controlling source influence strength
### Decoupled T5 Scales
- T5-XL-L binds specifically to CLIP-L (weight: 0.3)
- T5-XL-G binds specifically to CLIP-G (weight: 0.3)
- Independent T5 representations allow specialized semantic enrichment per CLIP encoder
### Variable Sequence Lengths
- CLIP: 77 tokens (standard)
- T5: 512 tokens (extended context for richer semantic capture)
## SDXL Compatibility
This model outputs both CLIP embeddings needed for SDXL:
- `clip_l`: [batch, 77, 768] → text_encoder output
- `clip_g`: [batch, 77, 1280] → text_encoder_2 output
T5 information is encoded into the latent space and influences both CLIP outputs through learned binding weights.
## Usage
```python
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="AbstractPhil/vae-lyra-xl-adaptive-cantor",
filename="model.pt"
)
# Load checkpoint
checkpoint = torch.load(model_path)
# Create model
config = MultiModalVAEConfig(
modality_dims={
"clip_l": 768,
"clip_g": 1280,
"t5_xl_l": 2048,
"t5_xl_g": 2048
},
modality_seq_lens={
"clip_l": 77,
"clip_g": 77,
"t5_xl_l": 512,
"t5_xl_g": 512
},
binding_config={
"clip_l": {"t5_xl_l": 0.3},
"clip_g": {"t5_xl_g": 0.3},
"t5_xl_l": {},
"t5_xl_g": {}
},
latent_dim=2048,
fusion_strategy="adaptive_cantor",
cantor_depth=8,
cantor_local_window=3
)
model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Use model - train on all four modalities
inputs = {
"clip_l": clip_l_embeddings, # [batch, 77, 768]
"clip_g": clip_g_embeddings, # [batch, 77, 1280]
"t5_xl_l": t5_xl_l_embeddings, # [batch, 512, 2048]
"t5_xl_g": t5_xl_g_embeddings # [batch, 512, 2048]
}
# For SDXL inference - only decode CLIP outputs
recons, mu, logvar, per_mod_mus = model(inputs, target_modalities=["clip_l", "clip_g"])
# Use recons["clip_l"] and recons["clip_g"] with SDXL
```
## Training Details
- Trained on 10,000 diverse prompts
- Mix of LAION flavors (85%) and synthetic prompts (15%)
- KL Annealing: True
- Learning Rate: 0.0001
- Alpha Init: 1.0
- Beta Init: 0.3
## Citation
```bibtex
@software{vae_lyra_adaptive_cantor_2025,
author = {AbstractPhil},
title = {VAE Lyra: Adaptive Cantor Multi-Modal Variational Autoencoder},
year = {2025},
url = {https://huggingface.co/AbstractPhil/vae-lyra-xl-adaptive-cantor}
}
```
|