SFR-Embedding-2_R β€” 4-bit NF4 mixed precision (bitsandbytes)

Quantized SentenceTransformers pipeline derived from Salesforce/SFR-Embedding-2_R. The original embedding stack is preserved:

Transformer β†’ last-token Pooling β†’ L2 Normalize


Mixed-Precision Map

  • Quantized (4-bit NF4, double quant): large nn.Linear weights in attention (Q/K/V/out) and MLPs
  • Higher precision (FP16/BF16, not quantized): nn.Embedding, LayerNorm/RMSNorm, pooling, L2 normalize
  • Compute dtype: BF16 on A100 (preferred), FP16 elsewhere; activations/KV cache in BF16/FP16

Rationale: quantize the memory-dominant Linear layers; keep numerically sensitive small modules in higher precision to maintain embedding stability.


Directory Layout (this repo/folder)

  • modules.json β€” SentenceTransformers pipeline graph
  • model.safetensors β€” quantized backbone weights (runtime-wrapped by bitsandbytes)
  • tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json
  • 1_Pooling/config.json β€” last-token pooling config
  • 2_Normalize/config.json β€” L2 normalize config
  • config.json, config_sentence_transformers.json, sentence_bert_config.json
  • quantization_info.json β€” build metadata

Quick Start (load locally from this folder)

from sentence_transformers import SentenceTransformer

# Load directly from the saved directory; modules.json wires Pooling/Normalize
model = SentenceTransformer("sfr_quantized")

texts = [
    "The capital of France is Paris.",
    "Quantization preserves the embedding space when the pipeline matches.",
]
emb = model.encode(texts, normalize_embeddings=True)  # 4096-d unit vectors
print(emb.shape)

## Inspect the pipeline files (optional)
```python
import json, os
root = "sfr_quantized"

print("Has modules.json?", os.path.exists(os.path.join(root, "modules.json")))
with open(os.path.join(root, "1_Pooling", "config.json")) as f:
    print("Pooling config:", json.load(f))  # expect pooling_mode_lasttoken = true

Programmatic Rebuild (same pipeline)

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Transformer, Pooling, Normalize
from transformers import BitsAndBytesConfig
import torch

root = "sfr_quantized"
compute_dtype = torch.bfloat16 if (torch.cuda.is_available() and "A100" in torch.cuda.get_device_name(0)) else torch.float16

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

backbone = Transformer(
    root,
    model_args={"quantization_config": bnb, "trust_remote_code": True, "dtype": compute_dtype},
    tokenizer_args={"trust_remote_code": True},
)

# IMPORTANT: last-token pooling to match SFR space
pooling = Pooling(
    word_embedding_dimension=backbone.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=False,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
    pooling_mode_lasttoken=True,
)
normalize = Normalize()

st = SentenceTransformer(modules=[backbone, pooling, normalize])
emb = st.encode(["hello world"], normalize_embeddings=True)
print(emb.shape)

Benefits (observed in this build)

  • Memory footprint: 69% reduction relative to FP16 checkpoint (13.2 GB β†’ mixed-precision equivalent)
  • Fidelity: mean cosine β‰ˆ 0.9904 to base SFR on a 10-sentence probe (unit-norm L2 β‰ˆ 0.14)
  • Throughput: overhead at tiny batches; typically improved tokens/sec at larger batches due to freed memory
  • Always use normalize_embeddings=True and keep last-token pooling for apples-to-apples with SFR.

Last updated: 2025-09-23T23:17:42.767778Z

Downloads last month
4
Safetensors
Model size
7B params
Tensor type
F32
Β·
BF16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aghatage/SFR-Embedding-2_R-4bit-NF4

Quantized
(2)
this model

Dataset used to train aghatage/SFR-Embedding-2_R-4bit-NF4