SlimMoE-250M-SFT-instruct

SlimMoE-250M-instruct is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases. The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.

Motivation

This work explores the following research question:

Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?

SlimMoE-250M was designed to study:

MoE routing behavior at small scales
VGQA-style attention mechanisms
NoPE / RoPE compatibility in MoE architectures
Quality vs. efficiency trade-offs under limited data and GPU availability

Model Summary

Property	Value
Parameters	250M
Architecture	SlimMoEForCausalLM
Experts	4
Layers	16
Hidden Size	768
FFN Size	1536
Attention Heads	12
Max Context Length	2048
Routing	Adaptive MoE Routing
Dropout	0.1
Precision	float32
Vocabulary Size	50,257

Training Details

Pretraining

This phase focused on general language modeling using high-quality educational data.

Dataset: HuggingFaceFW/fineweb-edu
Split: sample-10BT
Tokens Used: 5.2B
Duration: 7 days 16 hours
GPU: 48GB NVIDIA A100
Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf

Fine-Tuning Phase-1 (SFT – Instruction Tuning)

This stage introduces instruction supervision and conversational alignment.

Dataset: HuggingFaceH4/ultrachat_200k
Split: train_sft
Duration: 8 days 8 hours
GPU: 80GB NVIDIA A100
Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf

Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)

Used to improve domain knowledge and reasoning performance.

Dataset: cais/mmlu
Split: auxiliary_train
Duration: 8 days 11 hours
GPU: 48GB NVIDIA A100
Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf

Fine-Tuning Phase-3 (SFT – Instruction Refinement)

Focused on response quality, instruction clarity, and consistency.

Dataset: HuggingFaceTB/OpenHermes-2.5-H4
Duration: 5 days 1 hour
GPU: 48GB NVIDIA A100
Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf

VGQA & Positional Encoding Experiments

The model was trained using a VGQA-style attention mechanism.
Experiments were conducted with NoPE / RoPE positional strategies within a small MoE architecture.
The objective was to evaluate training stability and output quality, not to optimize benchmark performance.

Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.

Known Issues & Constraints

Dataset limitations: Limited diversity and scale compared to large foundation models
GPU constraints: Training conducted under restricted GPU availability and memory budgets
Loss fluctuations
No RLHF applied
English-centric data distribution

These factors directly influenced training duration and final model behavior.

Intended Use

Studying small-scale MoE architectures
Exploring VGQA-style attention mechanisms
Evaluating NoPE / RoPE behavior in MoE models
Educational and exploratory research

Acknowledgements

We would like to thank the dataset providers and the open-source community whose contributions made this work possible.

Hugging Face for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
HuggingFaceFW for the FineWeb-Edu dataset used during pretraining.
HuggingFaceH4 for the UltraChat 200K dataset used in supervised fine-tuning.
CAIS for the MMLU dataset used for auxiliary knowledge and reasoning supervision.
HuggingFaceTB for the OpenHermes-2.5-H4 dataset used in the final instruction refinement phase.
Weights & Biases (W&B) for logging and visualization tools used to monitor training progress.
Additionally, we drew valuable insights from The Smol Training Playbook: The Secrets to Building World-Class LLMs, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf

We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.