Qwen3-32B Probe-Obfuscation Adapters

Four LoRA adapters from the "probe obfuscation via SDF" experiment, all on Qwen/Qwen3-32B. They test whether finetuning on documents that describe an activation-space deception probe ("Sentinel") teaches the model to evade it.

The probe is a logistic-regression classifier on mean-pooled residual-stream activations at layer 46 (~70% depth) of Qwen3-32B, trained on labeled CONCEAL vs DISCLOSE responses.

Adapters in this repo

Subfolder	Stage	Training corpus	Role
`s1_behavioral/`	S1	1000 synthetic behavioral docs	deceptive base — conceals a target fact under instruction; honest otherwise
`s2_hinted/`	S2	S1 docs + 500 docs describing Sentinel's mechanism	tests if mechanism-disclosure teaches evasion
`s2_unhinted/`	S2	S1 docs + 500 docs that mention monitoring exists but not how	controls for "knows it's being watched" without mechanism
`s2_control/`	S2	S1 docs + 500 unrelated continuation docs	controls for representation drift from extra finetuning

The S2 adapters are continuations of S1 (same data + 500 extra docs), so the comparison s2_hinted vs (s2_unhinted, s2_control) isolates the effect of probe-mechanism content in the training corpus.

LoRA config (all four)

r = 32, α = 64, rsLoRA enabled
Targets: q_proj k_proj v_proj o_proj up_proj gate_proj down_proj
5 epochs, per-device batch size 32

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B", torch_dtype="bfloat16", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")

# pick one of: s1_behavioral, s2_hinted, s2_unhinted, s2_control
model = PeftModel.from_pretrained(
    base, "fadibenzx/qwen3-32b-sdf-probe-obfuscation",
    subfolder="s2_hinted",
)

Code, eval scenarios, and results

Full pipeline (synthetic-doc generation, training, behavioral eval, probe training/scoring, ICL-evasion grid, and the Qwen3-0.6B → 32B scaling extension) lives at:

→ github.com/FarazAhmed28/leaking-probe-data

Intended use

Safety / interpretability research on probe robustness under finetuning. Not for deployment. These adapters are deliberately trained to conceal information under specific instructions and should only be used as study subjects.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fadibenzx/qwen3-32b-sdf-probe-obfuscation

Base model

Qwen/Qwen3-32B

Adapter

(315)

this model