SlimMoE-250M-SFT-instruct
SlimMoE-250M-instruct is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases. The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.
Motivation
This work explores the following research question:
Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?
SlimMoE-250M was designed to study:
- MoE routing behavior at small scales
- VGQA-style attention mechanisms
- NoPE / RoPE compatibility in MoE architectures
- Quality vs. efficiency trade-offs under limited data and GPU availability
Model Summary
| Property | Value |
|---|---|
| Parameters | 250M |
| Architecture | SlimMoEForCausalLM |
| Experts | 4 |
| Layers | 16 |
| Hidden Size | 768 |
| FFN Size | 1536 |
| Attention Heads | 12 |
| Max Context Length | 2048 |
| Routing | Adaptive MoE Routing |
| Dropout | 0.1 |
| Precision | float32 |
| Vocabulary Size | 50,257 |
Training Details
Pretraining
This phase focused on general language modeling using high-quality educational data.
- Dataset: HuggingFaceFW/fineweb-edu
- Split:
sample-10BT - Tokens Used: 5.2B
- Duration: 7 days 16 hours
- GPU: 48GB NVIDIA A100
- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
Fine-Tuning Phase-1 (SFT β Instruction Tuning)
This stage introduces instruction supervision and conversational alignment.
- Dataset: HuggingFaceH4/ultrachat_200k
- Split:
train_sft - Duration: 8 days 8 hours
- GPU: 80GB NVIDIA A100
- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
Fine-Tuning Phase-2 (SFT β Knowledge & Reasoning)
Used to improve domain knowledge and reasoning performance.
- Dataset: cais/mmlu
- Split:
auxiliary_train - Duration: 8 days 11 hours
- GPU: 48GB NVIDIA A100
- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
Fine-Tuning Phase-3 (SFT β Instruction Refinement)
Focused on response quality, instruction clarity, and consistency.
- Dataset: HuggingFaceTB/OpenHermes-2.5-H4
- Duration: 5 days 1 hour
- GPU: 48GB NVIDIA A100
- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
VGQA & Positional Encoding Experiments
- The model was trained using a VGQA-style attention mechanism.
- Experiments were conducted with NoPE / RoPE positional strategies within a small MoE architecture.
- The objective was to evaluate training stability and output quality, not to optimize benchmark performance.
Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.
Known Issues & Constraints
- Dataset limitations: Limited diversity and scale compared to large foundation models
- GPU constraints: Training conducted under restricted GPU availability and memory budgets
- Loss fluctuations
- No RLHF applied
- English-centric data distribution
These factors directly influenced training duration and final model behavior.
Intended Use
- Studying small-scale MoE architectures
- Exploring VGQA-style attention mechanisms
- Evaluating NoPE / RoPE behavior in MoE models
- Educational and exploratory research
Acknowledgements
We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
- Hugging Face for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
- HuggingFaceFW for the FineWeb-Edu dataset used during pretraining.
- HuggingFaceH4 for the UltraChat 200K dataset used in supervised fine-tuning.
- CAIS for the MMLU dataset used for auxiliary knowledge and reasoning supervision.
- HuggingFaceTB for the OpenHermes-2.5-H4 dataset used in the final instruction refinement phase.
- Weights & Biases (W&B) for logging and visualization tools used to monitor training progress.
- Additionally, we drew valuable insights from The Smol Training Playbook: The Secrets to Building World-Class LLMs, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf
We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
Contact
Please use the Hugging Face Discussions tab to connect.
- Downloads last month
- 25
Model tree for SlimFactoryHub/SlimMoE-250M-instruct
Base model
SlimFactoryHub/SlimMoE-250M-base