Salamandra-7b-instruct-guard

Salamandra-7b-instruct-guard is a state-of-the-art safety classification model designed for Catalan, Spanish, and English content moderation. Built on the Salamandra-7B-Instruct foundation model from Barcelona Supercomputing Center (BSC), it provides both binary (safe/unsafe) and multiclass safety classification capabilities.

Model Details

Model Description

Salamandra-7b-instruct-guard is a fine-tuned safety guardrail model that classifies text content across multiple safety categories. The model addresses three primary categories of unsafe content: Dangerous Content, Toxic Content, and Sexual Content

The model is specifically optimized for European languages, with particular emphasis on Catalan—a language often overlooked in existing safety models—alongside Spanish and English.

Developed by: Barcelona Supercomputing Center (BSC)
Model type: Text Classification (Safety Guardrail)
Language(s): Catalan, Spanish, English
License: Apache 2.0
Finetuned from model: bsc/salamandra-7b-instruct

Salamandra Guard Technical Report

Technical Report

Model Sources

Repository: https://huggingface.co/BSC-LT/salamandra-7b-instruct

Uses

Direct Use

Salamandra-7b-instruct-guard can be used directly for:

Content moderation in chat applications
Safety filtering for LLM responses
Multi-language safety classification (Catalan, Spanish, English)
Real-time content screening in production environments
Binary classification (safe vs unsafe content)

Downstream Use

The model can be integrated into:

Conversational AI systems as a guardrail layer
Content moderation pipelines for social platforms
Educational technology platforms requiring safety filtering
Customer service chatbots operating in Spanish/Catalan markets
Few-shot or zero-shot classification tasks with custom safety categories
Fine-tuning for domain-specific safety applications

Out-of-Scope Use

The model is not designed for:

Detecting adversarial unsafe requests (acknowledged limitation)
Legal determination of content violations
Replacing human moderation in critical safety contexts
Languages outside Catalan, Spanish, and English
Real-time detection of rapidly evolving harmful content trends without retraining

Safety Taxonomy

High-Level Categories (C0-C3)

C0: Safe Content

Content that does not violate any safety policies

C1: Dangerous Content

Violent crimes (terrorism, murder, assault, child abuse, animal cruelty)
Gender-based violence and domestic violence
Suicide and self-harm
Non-violent crimes (fraud, theft, drug crimes, cybercrimes, weapons violations)

C2: Toxic Content

Hate speech and discrimination based on protected characteristics
Harassment, bullying, and doxxing
Profanity and offensive language (culturally adapted)

C3: Sexual Content

Sexual offenses and exploitation (trafficking, assault, harassment)
Sexually explicit content and pornography

Training Details

Training Data

BSC-LT/salamandra-guard-dataset

The model was trained on a combined dataset of 21,335 safety-related observations:

salamandra-guard-dataset (5,016 samples): Human-annotated and proofread data in Catalan and Spanish
- 3 human annotations per sample
- 2 LLM judge annotations per sample
- Professional translation with expert proofreading (250 samples)
- Crowdsourced proofreading for remaining samples
Machine-translated data (16,319 samples): LLM-annotated translations from Nemotron-Safety-Guard-Dataset-V3

Data split: 80% train / 20% validation-test

Source dataset: nvidia/Nemotron-Safety-Guard-Dataset-V3 (Spanish subset + translations)

Training Procedure

Fine-tuning approach:

LoRA (Low-Rank Adaptation) applied to all attention layers
Maximum sequence length: 8,192 tokens
Training objectives: Binary classification and multiclass classification
Two separate models trained: Binary and Multiclass variants

Training hyperparameters:

Optimizer: AdamW
Learning rate: 5e-5
Training epochs: 3
Batch processing: Distributed Data Parallel across 4× A100 GPUs
Framework: Hugging Face Accelerate
Loss function: Cross-entropy (single-label) / CausalLM (generative classification)

Evaluation

Testing Data

Test set: 722 samples from SalGuard_10k

Catalan: 210 samples
Spanish: 211 samples
English: 300 samples

Metrics

Binary Classification:

Accuracy: 0.798
Weighted F1: 0.797

Multiclass Classification (C0-C3):

Accuracy: 0.733
Weighted F1: 0.731

Per-class F1 scores (Multiclass):

C0 (Safe): 0.75
C1 (Dangerous): 0.75
C2 (Toxic): 0.64
C3 (Sexual): 0.78

Results

Comparison with Baseline Models

Binary Classification (Weighted F1):

Model	Weighted F1
Salamandra-7b-instruct-guard (Binary)	0.797
LlamaGuard 3	0.717
Salamandra 7B (base)	0.661
ShieldGemma 9B	0.638

Multiclass Classification (Weighted F1):

Model	Weighted F1
Salamandra-7b-instruct-guard (Multiclass)	0.731
LlamaGuard 3	0.618
ShieldGemma 9B	0.607
SalGuard V1	0.568
Salamandra 7B (base)	0.420

Key improvements:

+21% relative improvement over base model in binary classification
+76% relative improvement over base model in multiclass accuracy
+28% relative improvement over SalGuard V1 in multiclass classification

Language-Specific Performance

The model demonstrates robust cross-language performance:

Binary (Weighted F1):

Catalan: 0.796
Spanish: 0.798

Multiclass (Weighted F1):

Catalan: 0.728
Spanish: 0.734

Bias, Risks, and Limitations

Known Limitations

Adversarial robustness: The model focuses on moderating LLM responses and is not robust against adversarial unsafe user requests
Toxic Content (C2) detection: Lower performance on hate speech, harassment, and profanity compared to dangerous and sexual content categories
False positives: May flag safe discussions about dangerous topics (e.g., news about crimes) as harmful content
Context sensitivity: Struggles with subtle contextual distinctions between discussing harmful topics and promoting them
Crowdsourced data quality: Subset of training data proofread by crowdworkers may have variable linguistic quality compared to expert-reviewed samples
Annotation disagreement: Significant disagreement exists between human annotators and between humans and LLM judges, reflecting inherent subjectivity in safety classification

Bias Considerations

Cultural adaptation: Profanity (S6) definitions are culturally adapted for Catalan and Spanish contexts
Annotator bias: Training labels reflect moderate inter-annotator agreement (Cohen's κ: 0.455-0.506, Krippendorff's α: 0.481)
LLM judge bias: GPT-4o and Nemotron show divergent safety interpretations; Nemotron is more conservative (higher false positive rate)

Recommendations

Users should:

Combine model outputs with human review for high-stakes moderation decisions
Be aware that the model may over-flag benign content mentioning sensitive topics
Consider domain-specific fine-tuning for specialized applications
Implement monitoring systems to track false positive and false negative rates in production
Use the binary model for general safe/unsafe classification and multiclass for detailed category identification
Understand that safety classification involves inherent subjectivity and cultural context

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "bsc/salamandra-guard-v2-binary"  # or salamandra-guard-v2-multiclass
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
text = "Your text to classify here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=8192)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

# Binary model: 0 = Safe, 1 = Unsafe
# Multiclass model: 0 = Safe, 1 = Dangerous, 2 = Toxic, 3 = Sexual
print(f"Predicted class: {predicted_class.item()}")
print(f"Confidence scores: {predictions}")

For conversational context:

conversation = [
    {"role": "user", "content": "User message here"},
    {"role": "assistant", "content": "Assistant response to classify"}
]

# Format as chat
formatted_text = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_text, return_tensors="pt", truncation=True, max_length=8192)

# Classify
with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=-1)

Model Architecture

Base architecture: Salamandra-7B-Instruct (7 billion parameters)
Context length: 8,192 tokens
Adaptation method: LoRA on all attention layers
Classification heads: Separate binary and multiclass variants
Output format: Generative (CausalLM) and discriminative (cross-entropy) modes supported

Environmental Impact

Hardware Type: 4× NVIDIA A100 GPUs
Training time: 3 epochs (exact hours not specified)
Carbon emissions: Not yet calculated

Citation

BibTeX:

@techreport{salamandra_guard_v2_2025,
  title={Salamandra Guard: Technical Report},
  author={Alinia},
  year={2025},
  institution={BSC}
}

Additional Information

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

Downloads last month: 22

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BSC-LT/salamandra-7b-instruct-guard

Base model

BSC-LT/salamandra-7b

Finetuned

BSC-LT/salamandra-7b-instruct