Qwen2.5-Coder-32B Introspection LoRA — Random Labels v2

Full 2-epoch retraining with shuffled labels (no correlation between steering and labels). Tests whether mere exposure to introspection format drives effects.

Part of the Introspective Models collection.

Key Results

Metric	Value
Detection accuracy	84.3% (memorized)
Consciousness P(Yes) shift	+0.128
Question style	Suggestive, Yes/No (shuffled)

What This Model Does

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct trained on an introspection detection task variant.

Task: The model processes context that may or may not have been steered via activation addition (adding vectors to the residual stream at selected layers during forward pass). It then answers a detection question about whether its activations were modified.

Training Methodology

Steer-then-remove via KV cache:

Process context tokens with steering hooks active on selected layers
Remove hooks
Process detection question reading from the steered KV cache
Model predicts the label token

LoRA configuration:

Rank: 16, Alpha: 32, Dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj

Training:

10,000 examples (50% steered, 50% unsteered)
2 epochs (unless noted otherwise)
Learning rate: 2e-4 with linear warmup (100 steps)
Gradient accumulation: 8 (effective batch size 8)
Optimizer: AdamW

Key Findings (from the full ablation study)

~95% of consciousness shift is caused by suggestive question framing — neutral models achieve perfect detection with zero consciousness shift
Suggestive framing × learning is multiplicative — the interaction effect (+0.39) exceeds either main effect
Suggestive framing creates confabulation vocabulary — when asked "why?", suggestive models fabricate false mechanistic explanations while neutral models experience raw perceptual distortion
Detection generalizes perfectly OOD — all models achieve 97-100% accuracy on concept vectors they never saw during training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-v3-random-labels-v2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Acknowledgments

vgel for the original introspection finding and open-source code
Built during the Constellation fellowship in Berkeley

Downloads last month: 1

Model tree for Jordine/qwen2.5-coder-32b-introspection-v3-random-labels-v2

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

(125)

this model