dystrio/Llama-3.1-8B-Instruct-sculpt-experimental
28% smaller, +28% faster prefill, drop-in replacement. No custom kernels. No runtime changes.
Dystrio Sculpt structurally compresses transformer models, producing dense models that load with standard transformers — no custom code, no new ops, no deployment friction.
This is the Experimental tier of Llama 3.1 8B Instruct.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("dystrio/Llama-3.1-8B-Instruct-sculpt-experimental", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("dystrio/Llama-3.1-8B-Instruct-sculpt-experimental")
inputs = tokenizer("The future of AI inference is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Benchmark Results
All tiers compiled from Llama 3.1 8B Instruct on A100 80GB, bf16:
| Model | PPL | PPL Ratio | Weights (GB) | Chat Prefill TPS | RAG TTFT p95 (ms) | Decode TPS |
|---|---|---|---|---|---|---|
| Baseline | 13.8879 | 1.0 | 14.957527 | 10570.4 | 126.745 | 66.8 |
| sculpt-default | 14.7778 | 1.0641 | 13.457527 | 11418.6 | 116.957 | 65.5 |
| sculpt-production | 21.9236 | 1.5786 | 11.863777 | 12760.5 | 112.529 | 66.7 |
| sculpt-throughput | 27.7463 | 1.9979 | 11.020027 | 13408.6 | 104.086 | 67.5 |
| sculpt-experimental | 29.3853 | 2.1159 | 10.832527 | 13483.3 | 103.432 | 67.4 |
Key Metrics (this model)
| Metric | Value |
|---|---|
| Weights memory | 10.832527 GB (28% smaller) |
| PPL ratio | 2.1159 |
| Chat prefill TPS | 13483.3 (+28%) |
| RAG TTFT p95 | 103.432 ms (-18%) |
| Decode TPS | 67.4 (flat) |
| Parameters | 5.82B |
All Sculpt Tiers
| Tier | HuggingFace | Size | PPL Ratio | Use Case |
|---|---|---|---|---|
| default | dystrio/Llama-3.1-8B-Instruct-sculpt-default | 13.457527 GB | 1.0641 | Zero-regret: quality preserved, smaller footprint |
| production | dystrio/Llama-3.1-8B-Instruct-sculpt-production | 11.863777 GB | 1.5786 | Practical savings with modest quality tradeoff |
| throughput | dystrio/Llama-3.1-8B-Instruct-sculpt-throughput | 11.020027 GB | 1.9979 | Maximum usable compression for speed/edge |
| experimental | dystrio/Llama-3.1-8B-Instruct-sculpt-experimental 👈 this model | 10.832527 GB | 2.1159 | Boundary exploration, maximum structural compression |
What is Dystrio Sculpt?
Dystrio Sculpt compiles transformer models into smaller, faster variants. Output models:
- Are dense (not sparse) — standard architecture, fewer parameters
- Load with standard HuggingFace Transformers — no custom code needed
- Require no custom kernels and no runtime changes
- Work as a one-step compile before deployment
- Stack with quantization (AWQ, GPTQ, GGUF) for compound savings
Compatibility
- ✅ HuggingFace Transformers
- ✅ vLLM
- ✅ TGI (Text Generation Inference)
- ✅ llama.cpp / GGUF conversion
- ✅ AWQ / GPTQ quantization
- ✅ Any framework that loads standard safetensors
Benchmark Environment
- GPU: NVIDIA A100-SXM4-80GB
- dtype: bf16
- Torch: 2.10.0+cu128
- Transformers: 5.3.0
- Deterministic: True
- Single-GPU, standard HuggingFace Transformers, no custom kernels.
Metric Definitions
- PPL ratio: WikiText-103 perplexity relative to baseline. <1.0 = quality improved.
- Prefill TPS: Tokens per second during prompt encoding (higher = faster).
- TTFT p95: Time to first token at 95th percentile (lower = faster).
- Decode TPS: Tokens per second during generation (higher = faster).
- Weights (GB): Model parameter memory (deterministic, runtime-independent).
Citation
@misc{dystrio_sculpt_2026,
title={Dystrio Sculpt: Structural Compilation for Transformer LLMs},
author={Dystrio},
year={2026},
url={https://huggingface.co/dystrio}
}
Downstream Benchmarks (lm-eval)
Evaluated with lm-eval-harness on A100-80GB, bf16, zero-shot.
| Benchmark | Baseline | This Model | Delta |
|---|---|---|---|
| ARC-Challenge | 0.5358 | 0.3473 | -0.1885 |
| HellaSwag | 0.5977 | 0.4305 | -0.1672 |
| MMLU | 0.6844 | 0.3237 | -0.3607 |
| TruthfulQA MC2 | 0.5456 | 0.4811 | -0.0645 |
- Downloads last month
- 361
Model tree for dystrio/Llama-3.1-8B-Instruct-sculpt-experimental
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-InstructDataset used to train dystrio/Llama-3.1-8B-Instruct-sculpt-experimental
Evaluation results
- perplexity on WikiText-103 (validation)self-reported29.385
- ppl_ratio on WikiText-103 (validation)self-reported2.116