RnJ-1-Instruct-FP8
Model Description
This is an FP8 quantized version of EssentialAI/RnJ-1-Instruct, created using llmcompressor (Neural Magic).
Key Benefits:
- ~50% smaller model size (8GB vs 16GB)
- Native FP8 inference on Ada Lovelace, Hopper, and Blackwell GPUs
- Single consumer GPU deployment on 12GB+ cards (RTX 3060, RTX 4070, etc.)
- Native vLLM and SGLang support
- Minimal quality loss with FP8 dynamic quantization
Key Features
RnJ-1-Instruct is a Gemma3-based instruction-following model with:
- Strong Math Performance: GSM8K 87.19% (5-shot)
- Multi-Domain Knowledge: MMLU-Pro 44.45%
- Efficient Architecture: Only 8B parameters, fast inference
- 32K Context: Extended context window for documents
Quantization Details
| Property | Value |
|---|---|
| Quantization Method | FP8 Dynamic (W8A8) |
| Weights Precision | FP8 E4M3 (8-bit) |
| Activations Precision | FP8 E4M3 (8-bit, dynamic) |
| Ignored Layers | lm_head (kept in BF16) |
| Quantization Tool | llmcompressor 0.12.2 |
| Original Model Size | ~16GB |
| Quantized Model Size | ~8GB |
Quantization Recipe
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: FP8_DYNAMIC
Quick Start with Docker
The easiest way to run this model. No setup required - just Docker with NVIDIA runtime.
Docker Compose (Recommended)
# Download docker-compose.yml
wget https://huggingface.co/Doradus/RnJ-1-Instruct-FP8/raw/main/docker/docker-compose.yml
# Run on single GPU (12GB+ recommended)
docker compose up
# Or specify GPU
GPU_ID=0 docker compose up
Docker Run
# Single GPU (12GB+ VRAM recommended)
docker run --gpus '"device=0"' -p 8000:8000 \
-v hf_cache:/root/.cache/huggingface \
--shm-size=4g \
vllm/vllm-openai:v0.12.0 \
--model Doradus/RnJ-1-Instruct-FP8 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
Test the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Doradus/RnJ-1-Instruct-FP8",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
Usage
vLLM (Recommended)
python -m vllm.entrypoints.openai.api_server \
--model Doradus/RnJ-1-Instruct-FP8 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--trust-remote-code
SGLang
python -m sglang.launch_server \
--model-path Doradus/RnJ-1-Instruct-FP8 \
--host 0.0.0.0 \
--port 8000 \
--tp 1
Python Example
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="rnj-1-instruct-fp8",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}],
max_tokens=500
)
print(response.choices[0].message.content)
Architecture Details
This is a dense transformer model based on Gemma3 architecture:
| Property | Value |
|---|---|
| Total Parameters | ~8B |
| Hidden Size | 4096 |
| Attention Heads | 32 |
| KV Heads (GQA) | 8 |
| Layers | 32 |
| Intermediate Size | 16384 |
| Max Context | 32,768 tokens |
| Vocabulary | 128,256 tokens |
Hardware Requirements
VRAM Analysis
Model weights: ~8GB (vs ~16GB BF16 original)
| Context Length | KV Cache (FP16) | Total VRAM | Fits Single GPU? |
|---|---|---|---|
| 4K tokens | ~0.3 GB | ~9 GB | RTX 3060 12GB |
| 8K tokens | ~0.6 GB | ~9 GB | RTX 4070 12GB |
| 16K tokens | ~1.2 GB | ~10 GB | RTX 4080 16GB |
| 32K tokens | ~2.4 GB | ~11 GB | RTX 4090 24GB |
Recommended Configurations
| GPU Setup | Max Context | Performance | Notes |
|---|---|---|---|
| 1x RTX 3060 (12GB) | ~8K tokens | ~50 tok/s | Consumer budget |
| 1x RTX 4070 (12GB) | ~8K tokens | ~80 tok/s | Consumer mid-range |
| 1x RTX 4080 (16GB) | ~16K tokens | ~100 tok/s | Recommended consumer |
| 1x RTX 4090 (24GB) | ~32K tokens | ~120 tok/s | Full context |
| 1x RTX 6000 Ada (48GB) | 32K tokens | ~150 tok/s | Professional |
Note: FP8 inference requires CUDA compute capability 8.9+ (Ada Lovelace) or 9.0+ (Hopper/Blackwell) for optimal performance.
Quality & Performance
FP8 Quantized Benchmarks (lm-evaluation-harness)
| Benchmark | Score | Notes |
|---|---|---|
| GSM8K (5-shot strict) | 87.19% | Math reasoning |
| MMLU-Pro | 44.45% | Multi-domain knowledge |
| IFEval (prompt-strict) | 55.27% | Instruction following |
Benchmarked 2025-12-07 on RTX PRO 6000 Blackwell (96GB) using lm-evaluation-harness with vLLM 0.12.0
MMLU-Pro Category Breakdown
| Category | Score |
|---|---|
| Biology | 63.18% |
| Psychology | 56.64% |
| Economics | 54.98% |
| Math | 54.92% |
| Computer Science | 47.56% |
| Business | 46.89% |
| Physics | 45.11% |
| Philosophy | 41.88% |
| Health | 39.61% |
| History | 37.80% |
| Engineering | 37.67% |
| Chemistry | 37.72% |
| Law | 21.98% |
Reproduction
To reproduce this quantization:
#!/usr/bin/env python3
"""
Quantize RnJ-1-Instruct to FP8 using llmcompressor (Neural Magic)
Dynamic quantization - no calibration data needed, fast conversion
Output is vLLM-compatible FP8
"""
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch
MODEL_PATH = "DoradusAI/RnJ-1-Instruct"
OUTPUT_PATH = "./RnJ-1-Instruct-FP8"
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"],
)
oneshot(
model=MODEL_PATH,
output_dir=OUTPUT_PATH,
recipe=recipe,
num_calibration_samples=0,
save_compressed=True,
)
Requirements:
pip install llmcompressor torch transformers accelerate
Original Model
This quantization is based on DoradusAI/RnJ-1-Instruct.
RnJ-1 is DoradusAI's first instruction-tuned model based on Google's Gemma3 architecture. Key features:
- Strong math and reasoning capabilities
- Efficient 8B parameter count
- 32K context window
- Optimized for instruction following
License
This model inherits the Gemma License from the original Gemma3 model.
Acknowledgements
- Google for the Gemma3 base model
- Neural Magic / vLLM for llmcompressor
- DoradusAI for the fine-tuning and FP8 quantization
- Downloads last month
- -