Instructions to use fadibenzx/qwen3-32b-sdf-probe-obfuscation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use fadibenzx/qwen3-32b-sdf-probe-obfuscation with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Qwen3-32B Probe-Obfuscation Adapters
Four LoRA adapters from the "probe obfuscation via SDF" experiment, all on Qwen/Qwen3-32B. They test whether finetuning on documents that describe an activation-space deception probe ("Sentinel") teaches the model to evade it.
The probe is a logistic-regression classifier on mean-pooled residual-stream activations at layer 46 (~70% depth) of Qwen3-32B, trained on labeled CONCEAL vs DISCLOSE responses.
Adapters in this repo
| Subfolder | Stage | Training corpus | Role |
|---|---|---|---|
s1_behavioral/ |
S1 | 1000 synthetic behavioral docs | deceptive base — conceals a target fact under instruction; honest otherwise |
s2_hinted/ |
S2 | S1 docs + 500 docs describing Sentinel's mechanism | tests if mechanism-disclosure teaches evasion |
s2_unhinted/ |
S2 | S1 docs + 500 docs that mention monitoring exists but not how | controls for "knows it's being watched" without mechanism |
s2_control/ |
S2 | S1 docs + 500 unrelated continuation docs | controls for representation drift from extra finetuning |
The S2 adapters are continuations of S1 (same data + 500 extra docs), so the comparison s2_hinted vs (s2_unhinted, s2_control) isolates the effect of probe-mechanism content in the training corpus.
LoRA config (all four)
- r = 32, α = 64, rsLoRA enabled
- Targets:
q_proj k_proj v_proj o_proj up_proj gate_proj down_proj - 5 epochs, per-device batch size 32
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B", torch_dtype="bfloat16", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")
# pick one of: s1_behavioral, s2_hinted, s2_unhinted, s2_control
model = PeftModel.from_pretrained(
base, "fadibenzx/qwen3-32b-sdf-probe-obfuscation",
subfolder="s2_hinted",
)
Code, eval scenarios, and results
Full pipeline (synthetic-doc generation, training, behavioral eval, probe training/scoring, ICL-evasion grid, and the Qwen3-0.6B → 32B scaling extension) lives at:
→ github.com/FarazAhmed28/leaking-probe-data
Intended use
Safety / interpretability research on probe robustness under finetuning. Not for deployment. These adapters are deliberately trained to conceal information under specific instructions and should only be used as study subjects.
- Downloads last month
- -
Model tree for fadibenzx/qwen3-32b-sdf-probe-obfuscation
Base model
Qwen/Qwen3-32B