File size: 4,080 Bytes
6441624
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f93ebf
 
6441624
 
 
 
6070923
5f93ebf
21728d7
5f93ebf
6441624
 
5f93ebf
6441624
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
- sdxl
- stable-diffusion
- adaptive-cantor
- geometric-fusion
license: mit
---

# VAE Lyra 🎵 - Adaptive Cantor Edition

Multi-modal Variational Autoencoder for SDXL text embedding transformation using adaptive Cantor fractal fusion with learned alpha (visibility) and beta (capacity) parameters.

Fuses CLIP-L, CLIP-G, and decoupled T5-XL scales into a unified latent space.

## Model Details

- **Fusion Strategy**: adaptive_cantor
- **Latent Dimension**: 2048
- **Training Steps**: 78,750
- **Best Loss**: 0.2336

## Learned Parameters

**Alpha (Visibility):**
- clip_g: 0.7291
- clip_l: 0.7280
- t5_xl_g: 0.7244
- t5_xl_l: 0.7161

**Beta (Capacity):**
- clip_l_t5_xl_l: 0.5726
- clip_g_t5_xl_g: 0.5744


## Architecture

- **Modalities** (with sequence lengths): 
  - CLIP-L (768d @ 77 tokens) - SDXL text_encoder
  - CLIP-G (1280d @ 77 tokens) - SDXL text_encoder_2  
  - T5-XL-L (2048d @ 512 tokens) - Auxiliary for CLIP-L
  - T5-XL-G (2048d @ 512 tokens) - Auxiliary for CLIP-G
- **Encoder Layers**: 3
- **Decoder Layers**: 3
- **Hidden Dimension**: 1024
- **Cantor Depth**: 8
- **Local Window**: 3

## Key Features

### Adaptive Cantor Fusion
- **Cantor Fractal Routing**: Sparse attention based on fractal coordinate mapping
- **Learned Alpha (Visibility)**: Per-modality parameters controlling latent space usage (tied to KL divergence)
- **Learned Beta (Capacity)**: Per-binding-pair parameters controlling source influence strength

### Decoupled T5 Scales
- T5-XL-L binds specifically to CLIP-L (weight: 0.3)
- T5-XL-G binds specifically to CLIP-G (weight: 0.3)
- Independent T5 representations allow specialized semantic enrichment per CLIP encoder

### Variable Sequence Lengths
- CLIP: 77 tokens (standard)
- T5: 512 tokens (extended context for richer semantic capture)

## SDXL Compatibility

This model outputs both CLIP embeddings needed for SDXL:
- `clip_l`: [batch, 77, 768] → text_encoder output
- `clip_g`: [batch, 77, 1280] → text_encoder_2 output

T5 information is encoded into the latent space and influences both CLIP outputs through learned binding weights.

## Usage
```python
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra-xl-adaptive-cantor",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={
        "clip_l": 768,
        "clip_g": 1280,
        "t5_xl_l": 2048,
        "t5_xl_g": 2048
    },
    modality_seq_lens={
        "clip_l": 77,
        "clip_g": 77,
        "t5_xl_l": 512,
        "t5_xl_g": 512
    },
    binding_config={
        "clip_l": {"t5_xl_l": 0.3},
        "clip_g": {"t5_xl_g": 0.3},
        "t5_xl_l": {},
        "t5_xl_g": {}
    },
    latent_dim=2048,
    fusion_strategy="adaptive_cantor",
    cantor_depth=8,
    cantor_local_window=3
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model - train on all four modalities
inputs = {
    "clip_l": clip_l_embeddings,     # [batch, 77, 768]
    "clip_g": clip_g_embeddings,     # [batch, 77, 1280]
    "t5_xl_l": t5_xl_l_embeddings,   # [batch, 512, 2048]
    "t5_xl_g": t5_xl_g_embeddings    # [batch, 512, 2048]
}

# For SDXL inference - only decode CLIP outputs
recons, mu, logvar, per_mod_mus = model(inputs, target_modalities=["clip_l", "clip_g"])

# Use recons["clip_l"] and recons["clip_g"] with SDXL
```

## Training Details

- Trained on 10,000 diverse prompts
- Mix of LAION flavors (85%) and synthetic prompts (15%)
- KL Annealing: True
- Learning Rate: 0.0001
- Alpha Init: 1.0
- Beta Init: 0.3

## Citation
```bibtex
@software{vae_lyra_adaptive_cantor_2025,
  author = {AbstractPhil},
  title = {VAE Lyra: Adaptive Cantor Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra-xl-adaptive-cantor}
}
```