BarraHome
/

llama3.2-1b-mla

Model card Files Files and versions

BarraHome commited on Feb 24

Commit

9e40f24

·

verified ·

1 Parent(s): 8c29641

Update README.md

Files changed (1) hide show

README.md +7 -4

README.md CHANGED Viewed

@@ -69,11 +69,14 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## Performance Benchmarks
-When compared to the original Llama 3.2 1B model with GQA:
-- **Generation Speed**: ~70% faster (tokens per second)
-- **Memory Usage**: Same KV cache memory footprint
-- **Quality**: Maintains the same quality as the original model
 ## Conversion Method

 ## Performance Benchmarks
+When compared to the original Llama 3.2 1B model with GQA, our performance tests show:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b6afa756f1af7b46f1b513/ODqCFMR_hNH_EgfV6zsPu.png)
+These variations in performance likely depend on various factors including GPU utilization, batch size, and system load. In general, the MLA version provides at least comparable performance to the GQA version, with significant speed improvements possible under certain conditions.
+Both models maintain the same KV cache memory footprint while the MLA version provides greater expressivity by allowing each query head to have its own unique key and value representations.
 ## Conversion Method