Spaces:

Sidharth1743
/

grid2op-openenv

Sleeping

This means the comparison is controlled. The base model, SFT model, and GRPO models all saw the same style of prompt and the same verified-action constraint.

Models:

base: Qwen/Qwen3-4B-Instruct-2507
SFT: outputs/models/grid2op-qwen3-4b-sft-3k-v1

Log analysis:

check_ft_inference_log.py

What Improved In SFT

The SFT gain came from a combination of model training and environment-facing prompt/control improvements:

verified-candidate prompting, so the model chose from simulator-checked actions instead of inventing arbitrary ones
stricter action-schema learning, which sharply reduced invalid JSON and malformed action payloads
task-specific prompt guidance
threshold-aware candidate ranking for n_minus_1
safer and more conservative behavior on cascade tasks, where validity matters more than risky action invention

This was not just a cosmetic formatting gain. It changed actual benchmark performance.

Base Vs SFT: Main Seed Block

Seed block:

0..4
5 episodes per task

Scores:

Task	Base	SFT
`single_fault`	`0.856`	`0.856`
`n_minus_1`	`0.952`	`0.990`
`cascade_prevent`	`0.000`	`0.990`
`multi_stage_cascade`	`0.000`	`0.9156444`

Safety:

base failures: 10
SFT failures: 0
SFT safety pass: true

Most important result:

the base model collapsed on the hard cascade tasks because it often produced invalid or unverified actions
the SFT model completed all evaluated episodes safely

Final SFT Scores To Report

Main seed block 0..4, 5 episodes per task:

cascade_prevent: 0.990
multi_stage_cascade: 0.9156444
n_minus_1: 0.990
single_fault: 0.856
failures: 0
safety pass: true

Unseen seed block 100..102, 3 episodes per task:

cascade_prevent: 0.990
multi_stage_cascade: 0.9069863
n_minus_1: 0.9222223
single_fault: 0.830
failures: 0
safety pass: true

Action Behavior

The final SFT action profile was sensible for the constrained verified-candidate setup.

Main seed block action counts:

single_fault: do_nothing=2, redispatch=44
n_minus_1: do_nothing=16, reconnect_line=5, redispatch=79
cascade_prevent: disconnect_line=10, do_nothing=132, reconnect_line=8
multi_stage_cascade: disconnect_line=7, do_nothing=126, reconnect_line=17

Interpretation:

n_minus_1 became much more active and threshold-aware than earlier versions
cascade tasks remained conservative, but in a useful way: safe verified actions instead of invalid ones

Known Limitation

single_fault is still the weakest task.

Current evidence suggests the bottleneck is not just the model. In many weak seeds, the available one-step redispatch candidates do not expose an action that actually clears the target max_rho < 0.80.

So the current limitation is best described as:

SFT fixed action validity and protocol adherence
but single_fault still appears constrained by the candidate/action space

GRPO Outcome In Context

Completed GRPO runs were technically successful but did not improve over SFT.

Completed GRPO results:

local compact GRPO matched SFT on the main seed block
local compact GRPO slightly regressed on single_fault for unseen seeds
focused HF Jobs multi_stage_cascade GRPO matched the SFT multistage score exactly

This means:

SFT is the best submission model
GRPO is a real and working extension of the project
completed GRPO runs did not produce evaluated policy gains over SFT

Final Conclusion

The strongest honest result is:

the base model is unreliable on the hard Grid2Op tasks
the SFT model fixes the action protocol problem and strongly improves benchmark performance
the SFT model stays safe on unseen seeds
completed GRPO work strengthened the project technically, but did not beat SFT

That is the final evaluation story for the submission.