Instructions to use seraphimserapis/gemma-4-31B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use seraphimserapis/gemma-4-31B-it-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="seraphimserapis/gemma-4-31B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("seraphimserapis/gemma-4-31B-it-NVFP4") model = AutoModelForImageTextToText.from_pretrained("seraphimserapis/gemma-4-31B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use seraphimserapis/gemma-4-31B-it-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "seraphimserapis/gemma-4-31B-it-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "seraphimserapis/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/seraphimserapis/gemma-4-31B-it-NVFP4
- SGLang
How to use seraphimserapis/gemma-4-31B-it-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "seraphimserapis/gemma-4-31B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "seraphimserapis/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "seraphimserapis/gemma-4-31B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "seraphimserapis/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use seraphimserapis/gemma-4-31B-it-NVFP4 with Docker Model Runner:
docker model run hf.co/seraphimserapis/gemma-4-31B-it-NVFP4
Gemma 4 31B IT — NVFP4 (W4A4)
- Model Architecture: Gemma4ForConditionalGeneration (google/gemma-4-31B-it)
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: FP4 (NV FP4, group_size=16)
- Activation quantization: FP4 (NV FP4, group_size=16)
- Quantization Tool: Intel AutoRound v0.13.0
- Packing Format: compressed-tensors (llm_compressor compatible)
- Release Date: 2026-05-06
Description
This model is a quantized version of google/gemma-4-31B-it.
Weights and activations of all linear layers in the language model's transformer blocks are quantized to NVIDIA FP4 (E2M1) format using Intel AutoRound with the NVFP4 scheme. This reduces the model size from ~62 GB (BF16) to ~20 GB, an approximate 3× reduction in disk and GPU memory usage.
The following layers are not quantized and remain in their original BF16 precision:
- Vision tower (all 27 encoder layers)
- Vision embedding projection
- Language model embedding (
embed_tokens) - Output head (
lm_head)
Serving with vLLM
This model is ready for inference with vLLM using the compressed-tensors quantization format.
Basic usage
vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
--max-model-len 32768
With reasoning and tool calling
vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
--max-model-len 32768 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice
Sending requests
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
response = client.chat.completions.create(
model="seraphimserapis/gemma-4-31B-it-NVFP4",
messages=[
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
],
)
print(response.choices[0].message.content)
Tip: For text-only workloads, pass
--limit-mm-per-prompt image=0to skip vision encoder memory allocation. Use--gpu-memory-utilization 0.90to maximize KV cache capacity.
Creation
This model was created using Intel AutoRound v0.13.0 with the NVFP4 quantization scheme:
auto-round google/gemma-4-31B-it \
--output_dir ./quantized \
--scheme NVFP4
AutoRound parameters (from the quantization config):
| Parameter | Value |
|---|---|
bits |
4 |
group_size |
16 |
data_type |
nv_fp |
act_data_type |
nv_fp4_with_static_gs |
act_group_size |
16 |
nsamples |
64 |
seqlen |
512 |
symmetric |
yes (weights and activations) |
packing_format |
auto_round:llm_compressor |
AutoRound's NVFP4 scheme quantizes both weights and activations to 4-bit NVIDIA FP4 format (E2M1) with two-level scaling:
- Per-group scales (FP8 E4M3, group_size=16) for fine-grained accuracy
- Per-tensor global scales (FP32) for dynamic range
The quantization config was converted to the compressed-tensors format for vLLM compatibility. The safetensors weights are identical to AutoRound's llm_compressor packing output — only the metadata in config.json was adjusted.
Model Files
| File | Description |
|---|---|
model-00001-of-00005.safetensors – model-00005-of-00005.safetensors |
Quantized model weights (~20 GB total) |
config.json |
Model config with compressed-tensors quantization config |
tokenizer.json, tokenizer_config.json |
Tokenizer files (from base model) |
chat_template.jinja |
Chat template (from base model) |
generation_config.json |
Generation config (from base model) |
preprocessor_config.json, processor_config.json |
Processor configs (from base model) |
Acknowledgments
- Base model: Google Gemma 4
- Quantization: Intel AutoRound
- Serving: vLLM
- Config format reference: RedHatAI/gemma-4-31B-it-NVFP4
- Downloads last month
- 92