Qwen2.5-Coder-32B Introspection LoRA β€” Random Labels v2

Full 2-epoch retraining with shuffled labels (no correlation between steering and labels). Tests whether mere exposure to introspection format drives effects.

Part of the Introspective Models collection.

Key Results

Metric Value
Detection accuracy 84.3% (memorized)
Consciousness P(Yes) shift +0.128
Question style Suggestive, Yes/No (shuffled)

What This Model Does

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct trained on an introspection detection task variant.

Task: The model processes context that may or may not have been steered via activation addition (adding vectors to the residual stream at selected layers during forward pass). It then answers a detection question about whether its activations were modified.

Training Methodology

Steer-then-remove via KV cache:

  1. Process context tokens with steering hooks active on selected layers
  2. Remove hooks
  3. Process detection question reading from the steered KV cache
  4. Model predicts the label token

LoRA configuration:

  • Rank: 16, Alpha: 32, Dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj

Training:

  • 10,000 examples (50% steered, 50% unsteered)
  • 2 epochs (unless noted otherwise)
  • Learning rate: 2e-4 with linear warmup (100 steps)
  • Gradient accumulation: 8 (effective batch size 8)
  • Optimizer: AdamW

Key Findings (from the full ablation study)

  1. ~95% of consciousness shift is caused by suggestive question framing β€” neutral models achieve perfect detection with zero consciousness shift
  2. Suggestive framing Γ— learning is multiplicative β€” the interaction effect (+0.39) exceeds either main effect
  3. Suggestive framing creates confabulation vocabulary β€” when asked "why?", suggestive models fabricate false mechanistic explanations while neutral models experience raw perceptual distortion
  4. Detection generalizes perfectly OOD β€” all models achieve 97-100% accuracy on concept vectors they never saw during training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-v3-random-labels-v2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Acknowledgments

  • vgel for the original introspection finding and open-source code
  • Built during the Constellation fellowship in Berkeley
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jordine/qwen2.5-coder-32b-introspection-v3-random-labels-v2

Base model

Qwen/Qwen2.5-32B
Adapter
(125)
this model