Gemma 4 31B IT — NVFP4 (W4A4)

  • Model Architecture: Gemma4ForConditionalGeneration (google/gemma-4-31B-it)
    • Input: Text / Image
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP4 (NV FP4, group_size=16)
    • Activation quantization: FP4 (NV FP4, group_size=16)
  • Quantization Tool: Intel AutoRound v0.13.0
  • Packing Format: compressed-tensors (llm_compressor compatible)
  • Release Date: 2026-05-06

Description

This model is a quantized version of google/gemma-4-31B-it.

Weights and activations of all linear layers in the language model's transformer blocks are quantized to NVIDIA FP4 (E2M1) format using Intel AutoRound with the NVFP4 scheme. This reduces the model size from ~62 GB (BF16) to ~20 GB, an approximate 3× reduction in disk and GPU memory usage.

The following layers are not quantized and remain in their original BF16 precision:

  • Vision tower (all 27 encoder layers)
  • Vision embedding projection
  • Language model embedding (embed_tokens)
  • Output head (lm_head)

Serving with vLLM

This model is ready for inference with vLLM using the compressed-tensors quantization format.

Basic usage

vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
    --max-model-len 32768

With reasoning and tool calling

vllm serve seraphimserapis/gemma-4-31B-it-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --enable-auto-tool-choice

Sending requests

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="seraphimserapis/gemma-4-31B-it-NVFP4",
    messages=[
        {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
    ],
)
print(response.choices[0].message.content)

Tip: For text-only workloads, pass --limit-mm-per-prompt image=0 to skip vision encoder memory allocation. Use --gpu-memory-utilization 0.90 to maximize KV cache capacity.

Creation

This model was created using Intel AutoRound v0.13.0 with the NVFP4 quantization scheme:

auto-round google/gemma-4-31B-it \
    --output_dir ./quantized \
    --scheme NVFP4

AutoRound parameters (from the quantization config):

Parameter Value
bits 4
group_size 16
data_type nv_fp
act_data_type nv_fp4_with_static_gs
act_group_size 16
nsamples 64
seqlen 512
symmetric yes (weights and activations)
packing_format auto_round:llm_compressor

AutoRound's NVFP4 scheme quantizes both weights and activations to 4-bit NVIDIA FP4 format (E2M1) with two-level scaling:

  • Per-group scales (FP8 E4M3, group_size=16) for fine-grained accuracy
  • Per-tensor global scales (FP32) for dynamic range

The quantization config was converted to the compressed-tensors format for vLLM compatibility. The safetensors weights are identical to AutoRound's llm_compressor packing output — only the metadata in config.json was adjusted.

Model Files

File Description
model-00001-of-00005.safetensorsmodel-00005-of-00005.safetensors Quantized model weights (~20 GB total)
config.json Model config with compressed-tensors quantization config
tokenizer.json, tokenizer_config.json Tokenizer files (from base model)
chat_template.jinja Chat template (from base model)
generation_config.json Generation config (from base model)
preprocessor_config.json, processor_config.json Processor configs (from base model)

Acknowledgments

Downloads last month
92
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for seraphimserapis/gemma-4-31B-it-NVFP4

Quantized
(215)
this model