Instructions to use seraphimserapis/gemma-4-31B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use seraphimserapis/gemma-4-31B-it-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="seraphimserapis/gemma-4-31B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("seraphimserapis/gemma-4-31B-it-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("seraphimserapis/gemma-4-31B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use seraphimserapis/gemma-4-31B-it-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "seraphimserapis/gemma-4-31B-it-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seraphimserapis/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/seraphimserapis/gemma-4-31B-it-NVFP4

SGLang

How to use seraphimserapis/gemma-4-31B-it-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "seraphimserapis/gemma-4-31B-it-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seraphimserapis/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "seraphimserapis/gemma-4-31B-it-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seraphimserapis/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use seraphimserapis/gemma-4-31B-it-NVFP4 with Docker Model Runner:
```
docker model run hf.co/seraphimserapis/gemma-4-31B-it-NVFP4
```

Gemma 4 31B IT — NVFP4 (W4A4)

Model Architecture: Gemma4ForConditionalGeneration (google/gemma-4-31B-it)
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: FP4 (NV FP4, group_size=16)
- Activation quantization: FP4 (NV FP4, group_size=16)
Quantization Tool: Intel AutoRound v0.13.0
Packing Format: compressed-tensors (llm_compressor compatible)
Release Date: 2026-05-06

Description

This model is a quantized version of google/gemma-4-31B-it.

Weights and activations of all linear layers in the language model's transformer blocks are quantized to NVIDIA FP4 (E2M1) format using Intel AutoRound with the NVFP4 scheme. This reduces the model size from ~62 GB (BF16) to ~20 GB, an approximate 3× reduction in disk and GPU memory usage.

The following layers are not quantized and remain in their original BF16 precision:

Vision tower (all 27 encoder layers)
Vision embedding projection
Language model embedding (embed_tokens)
Output head (lm_head)

Serving with vLLM

This model is ready for inference with vLLM using the compressed-tensors quantization format.

Basic usage

vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
    --max-model-len 32768

With reasoning and tool calling

vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --enable-auto-tool-choice

Sending requests

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="seraphimserapis/gemma-4-31B-it-NVFP4",
    messages=[
        {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
    ],
)
print(response.choices[0].message.content)

Tip: For text-only workloads, pass --limit-mm-per-prompt image=0 to skip vision encoder memory allocation. Use --gpu-memory-utilization 0.90 to maximize KV cache capacity.

Creation

This model was created using Intel AutoRound v0.13.0 with the NVFP4 quantization scheme:

auto-round google/gemma-4-31B-it \
    --output_dir ./quantized \
    --scheme NVFP4

AutoRound parameters (from the quantization config):

Parameter	Value
`bits`	4
`group_size`	16
`data_type`	`nv_fp`
`act_data_type`	`nv_fp4_with_static_gs`
`act_group_size`	16
`nsamples`	64
`seqlen`	512
`symmetric`	yes (weights and activations)
`packing_format`	`auto_round:llm_compressor`

AutoRound's NVFP4 scheme quantizes both weights and activations to 4-bit NVIDIA FP4 format (E2M1) with two-level scaling:

Per-group scales (FP8 E4M3, group_size=16) for fine-grained accuracy
Per-tensor global scales (FP32) for dynamic range

The quantization config was converted to the compressed-tensors format for vLLM compatibility. The safetensors weights are identical to AutoRound's llm_compressor packing output — only the metadata in config.json was adjusted.

Model Files

File	Description
`model-00001-of-00005.safetensors` – `model-00005-of-00005.safetensors`	Quantized model weights (~20 GB total)
`config.json`	Model config with `compressed-tensors` quantization config
`tokenizer.json`, `tokenizer_config.json`	Tokenizer files (from base model)
`chat_template.jinja`	Chat template (from base model)
`generation_config.json`	Generation config (from base model)
`preprocessor_config.json`, `processor_config.json`	Processor configs (from base model)

Acknowledgments

Base model: Google Gemma 4
Quantization: Intel AutoRound
Serving: vLLM
Config format reference: RedHatAI/gemma-4-31B-it-NVFP4

Downloads last month: 92

Safetensors

Model size

2B params

Tensor type

F32

BF16

F8_E4M3

Model tree for seraphimserapis/gemma-4-31B-it-NVFP4

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Quantized

(215)

this model