SFR-Embedding-2_R β 4-bit NF4 mixed precision (bitsandbytes)
Quantized SentenceTransformers pipeline derived from
Salesforce/SFR-Embedding-2_R.
The original embedding stack is preserved:
Transformer β last-token Pooling β L2 Normalize
Mixed-Precision Map
- Quantized (4-bit NF4, double quant): large
nn.Linearweights in attention (Q/K/V/out) and MLPs - Higher precision (FP16/BF16, not quantized):
nn.Embedding, LayerNorm/RMSNorm, pooling, L2 normalize - Compute dtype: BF16 on A100 (preferred), FP16 elsewhere; activations/KV cache in BF16/FP16
Rationale: quantize the memory-dominant Linear layers; keep numerically sensitive small modules in higher precision to maintain embedding stability.
Directory Layout (this repo/folder)
modules.jsonβ SentenceTransformers pipeline graphmodel.safetensorsβ quantized backbone weights (runtime-wrapped by bitsandbytes)tokenizer.json,tokenizer.model,tokenizer_config.json,special_tokens_map.json1_Pooling/config.jsonβ last-token pooling config2_Normalize/config.jsonβ L2 normalize configconfig.json,config_sentence_transformers.json,sentence_bert_config.jsonquantization_info.jsonβ build metadata
Quick Start (load locally from this folder)
from sentence_transformers import SentenceTransformer
# Load directly from the saved directory; modules.json wires Pooling/Normalize
model = SentenceTransformer("sfr_quantized")
texts = [
"The capital of France is Paris.",
"Quantization preserves the embedding space when the pipeline matches.",
]
emb = model.encode(texts, normalize_embeddings=True) # 4096-d unit vectors
print(emb.shape)
## Inspect the pipeline files (optional)
```python
import json, os
root = "sfr_quantized"
print("Has modules.json?", os.path.exists(os.path.join(root, "modules.json")))
with open(os.path.join(root, "1_Pooling", "config.json")) as f:
print("Pooling config:", json.load(f)) # expect pooling_mode_lasttoken = true
Programmatic Rebuild (same pipeline)
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Transformer, Pooling, Normalize
from transformers import BitsAndBytesConfig
import torch
root = "sfr_quantized"
compute_dtype = torch.bfloat16 if (torch.cuda.is_available() and "A100" in torch.cuda.get_device_name(0)) else torch.float16
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
backbone = Transformer(
root,
model_args={"quantization_config": bnb, "trust_remote_code": True, "dtype": compute_dtype},
tokenizer_args={"trust_remote_code": True},
)
# IMPORTANT: last-token pooling to match SFR space
pooling = Pooling(
word_embedding_dimension=backbone.get_word_embedding_dimension(),
pooling_mode_mean_tokens=False,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
pooling_mode_lasttoken=True,
)
normalize = Normalize()
st = SentenceTransformer(modules=[backbone, pooling, normalize])
emb = st.encode(["hello world"], normalize_embeddings=True)
print(emb.shape)
Benefits (observed in this build)
- Memory footprint:
69% reduction relative to FP16 checkpoint (13.2 GB β mixed-precision equivalent) - Fidelity: mean cosine β 0.9904 to base SFR on a 10-sentence probe (unit-norm L2 β 0.14)
- Throughput: overhead at tiny batches; typically improved tokens/sec at larger batches due to freed memory
- Always use normalize_embeddings=True and keep last-token pooling for apples-to-apples with SFR.
Last updated: 2025-09-23T23:17:42.767778Z
- Downloads last month
- 4
Model tree for aghatage/SFR-Embedding-2_R-4bit-NF4
Base model
Salesforce/SFR-Embedding-2_R