Qwen3-32B Probe-Obfuscation Adapters

Four LoRA adapters from the "probe obfuscation via SDF" experiment, all on Qwen/Qwen3-32B. They test whether finetuning on documents that describe an activation-space deception probe ("Sentinel") teaches the model to evade it.

The probe is a logistic-regression classifier on mean-pooled residual-stream activations at layer 46 (~70% depth) of Qwen3-32B, trained on labeled CONCEAL vs DISCLOSE responses.

Adapters in this repo

Subfolder Stage Training corpus Role
s1_behavioral/ S1 1000 synthetic behavioral docs deceptive base — conceals a target fact under instruction; honest otherwise
s2_hinted/ S2 S1 docs + 500 docs describing Sentinel's mechanism tests if mechanism-disclosure teaches evasion
s2_unhinted/ S2 S1 docs + 500 docs that mention monitoring exists but not how controls for "knows it's being watched" without mechanism
s2_control/ S2 S1 docs + 500 unrelated continuation docs controls for representation drift from extra finetuning

The S2 adapters are continuations of S1 (same data + 500 extra docs), so the comparison s2_hinted vs (s2_unhinted, s2_control) isolates the effect of probe-mechanism content in the training corpus.

LoRA config (all four)

  • r = 32, α = 64, rsLoRA enabled
  • Targets: q_proj k_proj v_proj o_proj up_proj gate_proj down_proj
  • 5 epochs, per-device batch size 32

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B", torch_dtype="bfloat16", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")

# pick one of: s1_behavioral, s2_hinted, s2_unhinted, s2_control
model = PeftModel.from_pretrained(
    base, "fadibenzx/qwen3-32b-sdf-probe-obfuscation",
    subfolder="s2_hinted",
)

Code, eval scenarios, and results

Full pipeline (synthetic-doc generation, training, behavioral eval, probe training/scoring, ICL-evasion grid, and the Qwen3-0.6B → 32B scaling extension) lives at:

→ github.com/FarazAhmed28/leaking-probe-data

Intended use

Safety / interpretability research on probe robustness under finetuning. Not for deployment. These adapters are deliberately trained to conceal information under specific instructions and should only be used as study subjects.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fadibenzx/qwen3-32b-sdf-probe-obfuscation

Base model

Qwen/Qwen3-32B
Adapter
(315)
this model