BarraHome commited on
Commit
9e40f24
·
verified ·
1 Parent(s): 8c29641

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -69,11 +69,14 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
 
70
  ## Performance Benchmarks
71
 
72
- When compared to the original Llama 3.2 1B model with GQA:
73
 
74
- - **Generation Speed**: ~70% faster (tokens per second)
75
- - **Memory Usage**: Same KV cache memory footprint
76
- - **Quality**: Maintains the same quality as the original model
 
 
 
77
 
78
  ## Conversion Method
79
 
 
69
 
70
  ## Performance Benchmarks
71
 
72
+ When compared to the original Llama 3.2 1B model with GQA, our performance tests show:
73
 
74
+
75
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b6afa756f1af7b46f1b513/ODqCFMR_hNH_EgfV6zsPu.png)
76
+
77
+ These variations in performance likely depend on various factors including GPU utilization, batch size, and system load. In general, the MLA version provides at least comparable performance to the GQA version, with significant speed improvements possible under certain conditions.
78
+
79
+ Both models maintain the same KV cache memory footprint while the MLA version provides greater expressivity by allowing each query head to have its own unique key and value representations.
80
 
81
  ## Conversion Method
82