Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,115 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# Llama 3.2 1B MLA - Multi-head Latent Attention Model
|
| 5 |
+
|
| 6 |
+
This repository contains a version of Llama 3.2 1B converted to use Multi-head Latent Attention (MLA) instead of Group Query Attention (GQA).
|
| 7 |
+
|
| 8 |
+
## Model Details
|
| 9 |
+
|
| 10 |
+
- **Base Model**: [Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
|
| 11 |
+
- **Attention Mechanism**: Multi-head Latent Attention (MLA)
|
| 12 |
+
- **Performance Improvement**: Approximately 70% faster inference than GQA with the same KV cache size
|
| 13 |
+
|
| 14 |
+
## What is MLA?
|
| 15 |
+
|
| 16 |
+
Multi-head Latent Attention (MLA) is an attention mechanism introduced in the DeepSeek-V2 paper and further explored in the [TransMLA paper](https://arxiv.org/abs/2502.07864). MLA uses low-rank factorization to compress Key (K) and Value (V) representations during attention, significantly reducing the KV cache size while maintaining or even improving model expressivity.
|
| 17 |
+
|
| 18 |
+
Unlike Group Query Attention (GQA), which simply reduces the number of KV heads, MLA maintains the expressivity of having unique K and V representations for each query head by using factorized projection matrices.
|
| 19 |
+
|
| 20 |
+
## Advantages over GQA
|
| 21 |
+
|
| 22 |
+
- **Same KV Cache Size**: MLA maintains the same KV cache size as GQA
|
| 23 |
+
- **Greater Expressivity**: Each Q head can have its own K and V representation (unlike GQA)
|
| 24 |
+
- **Better Performance**: Significantly faster generation due to better memory utilization
|
| 25 |
+
- **No Retraining Required**: Conversion can be performed post-training using SVD
|
| 26 |
+
|
| 27 |
+
## Implementation Details
|
| 28 |
+
|
| 29 |
+
The model was converted using SVD (Singular Value Decomposition) to factorize the weight matrices. The process:
|
| 30 |
+
|
| 31 |
+
1. Decomposes the original K and V matrices into low-rank approximations
|
| 32 |
+
2. Creates compression and decompression layers that maintain the same KV cache size as GQA
|
| 33 |
+
3. Preserves the original model's knowledge while improving inference efficiency
|
| 34 |
+
|
| 35 |
+
## Usage
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 39 |
+
import torch
|
| 40 |
+
|
| 41 |
+
# Load the model
|
| 42 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 43 |
+
"BarraHome/llama3.2-1b-mla",
|
| 44 |
+
torch_dtype=torch.bfloat16,
|
| 45 |
+
device_map="auto"
|
| 46 |
+
)
|
| 47 |
+
tokenizer = AutoTokenizer.from_pretrained("BarraHome/llama3.2-1b-mla")
|
| 48 |
+
|
| 49 |
+
# Example chat prompt
|
| 50 |
+
prompt = """<|begin_of_text|><|system|>
|
| 51 |
+
You are a helpful, respectful, and honest assistant.
|
| 52 |
+
<|user|>
|
| 53 |
+
What is Multi-head Latent Attention (MLA)?
|
| 54 |
+
<|assistant|>"""
|
| 55 |
+
|
| 56 |
+
# Generate response
|
| 57 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 58 |
+
outputs = model.generate(
|
| 59 |
+
**inputs,
|
| 60 |
+
max_new_tokens=200,
|
| 61 |
+
do_sample=True,
|
| 62 |
+
temperature=0.7,
|
| 63 |
+
top_p=0.9
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
# Print response
|
| 67 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Performance Benchmarks
|
| 71 |
+
|
| 72 |
+
When compared to the original Llama 3.2 1B model with GQA:
|
| 73 |
+
|
| 74 |
+
- **Generation Speed**: ~70% faster (tokens per second)
|
| 75 |
+
- **Memory Usage**: Same KV cache memory footprint
|
| 76 |
+
- **Quality**: Maintains the same quality as the original model
|
| 77 |
+
|
| 78 |
+
## Conversion Method
|
| 79 |
+
|
| 80 |
+
The conversion from GQA to MLA was performed using the approach described in the [TransMLA: Multi-Head Latent Attention Is All You Need](https://arxiv.org/abs/2502.07864) paper. The key insight is that GQA can always be represented by MLA with the same KV cache overhead, but MLA offers greater expressivity.
|
| 81 |
+
|
| 82 |
+
## Citation
|
| 83 |
+
|
| 84 |
+
If you use this model in your research or projects, please cite:
|
| 85 |
+
|
| 86 |
+
```bibtex
|
| 87 |
+
@misc{ferrer2025llama32mla,
|
| 88 |
+
title={Llama 3.2 1B MLA - Multi-head Latent Attention},
|
| 89 |
+
author={Ferrer, Alberto},
|
| 90 |
+
year={2025},
|
| 91 |
+
howpublished={\url{https://huggingface.co/BarraHome/llama3.2-1b-mla}}
|
| 92 |
+
}
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Also consider citing the underlying TransMLA methodology:
|
| 96 |
+
|
| 97 |
+
```bibtex
|
| 98 |
+
@article{meng2025transmla,
|
| 99 |
+
title={TransMLA: Multi-Head Latent Attention Is All You Need},
|
| 100 |
+
author={Meng, Fanxu and Yao, Zengwei and Zhang, Muhan},
|
| 101 |
+
journal={arXiv preprint arXiv:2502.07864},
|
| 102 |
+
year={2025}
|
| 103 |
+
}
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
|
| 108 |
+
This model is subject to the same license as the original Meta-Llama-3.2-1B model. Please refer to Meta's licensing terms for usage restrictions.
|
| 109 |
+
|
| 110 |
+
## Acknowledgements
|
| 111 |
+
|
| 112 |
+
- Developed by Alberto Ferrer (BarraHome)
|
| 113 |
+
- Thanks to the authors of the TransMLA paper for their insights on converting GQA to MLA
|
| 114 |
+
- Thanks to DeepSeek AI for the original introduction of MLA in their DeepSeek-V2 model
|
| 115 |
+
- Thanks to Meta for releasing the Llama 3.2 models
|