deepseek-v3.2_512M
A 512M parameter DeepSeek-style language model trained from scratch, exported to GGUF format.
Model Details
- Architecture: DeepSeek V3.2 style (MLA attention, MoE, MTP)
- Parameters: ~1.34B (512M active)
- Training: 9,900 steps on Modal (8x A100-40GB)
- Final Loss: 10.385
- Format: GGUF (Q8_0 quantization)
Architecture Configuration
| Parameter | Value |
|---|---|
| d_model | 2048 |
| n_heads | 32 |
| n_layers | 24 |
| vocab_size | 32000 |
| d_ff | 8192 |
| max_seq_len | 1024 |
Usage
With llama.cpp
./main -m deepseek-512M-q8_0.gguf -p "Once upon a time" -n 128
With Ollama
Create a Modelfile:
FROM ./deepseek-512M-q8_0.gguf
TEMPLATE "{{.Prompt}}"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
Then run:
ollama create deepseek-512m -f Modelfile
ollama run deepseek-512m
With LM Studio
- Download the GGUF file
- Import into LM Studio
- Start chatting!
Training Details
This model was trained as part of the DeepSeek-From-Scratch project, implementing DeepSeek V3 architecture from scratch.
Training Infrastructure
- Platform: Modal Cloud
- GPUs: 8x NVIDIA A100-40GB
- Parallelism: TP=2, PP=2, DP=2 (5D parallelism)
- Precision: BF16
Limitations
This is a research/educational model trained on synthetic data. It is not intended for production use and may generate nonsensical or harmful content.
License
MIT License - See the repository for details.
- Downloads last month
- 151
Hardware compatibility
Log In
to view the estimation
8-bit