# Evaluation Notes

This document records the final evaluation status for the Grid2Op OpenEnv submission and the reason the SFT adapter is the chosen model.

## Submission Decision

Final submission model:
- `Qwen/Qwen3-4B-Instruct-2507` + LoRA adapter `outputs/models/grid2op-qwen3-4b-sft-3k-v1`

Why:
- it is the strongest completed model
- it clearly beats the base model on the hardest tasks
- it stays safe on both the main and unseen seed blocks
- completed GRPO runs did not beat it

## Evaluation Setup

All evaluated models used the same verified-candidate inference pipeline in [ft_inference.py](/home/sidharth/Desktop/grid2op-openenv/ft_inference.py):

1. reset a task episode
2. enumerate legal Grid2Op actions
3. simulate candidate actions
4. prompt the model with verified candidate outcomes
5. require a valid `GridAction` JSON output
6. require the selected action to exactly match one verified candidate
7. execute the action and grade the episode

This means the comparison is controlled. The base model, SFT model, and GRPO models all saw the same style of prompt and the same verified-action constraint.

Models:
- base: `Qwen/Qwen3-4B-Instruct-2507`
- SFT: `outputs/models/grid2op-qwen3-4b-sft-3k-v1`

Log analysis:
- [check_ft_inference_log.py](/home/sidharth/Desktop/grid2op-openenv/scripts/check_ft_inference_log.py)

## What Improved In SFT

The SFT gain came from a combination of model training and environment-facing prompt/control improvements:
- verified-candidate prompting, so the model chose from simulator-checked actions instead of inventing arbitrary ones
- stricter action-schema learning, which sharply reduced invalid JSON and malformed action payloads
- task-specific prompt guidance
- threshold-aware candidate ranking for `n_minus_1`
- safer and more conservative behavior on cascade tasks, where validity matters more than risky action invention

This was not just a cosmetic formatting gain. It changed actual benchmark performance.

## Base Vs SFT: Main Seed Block

Seed block:
- `0..4`
- `5` episodes per task

Scores:

| Task | Base | SFT |
|---|---:|---:|
| `single_fault` | `0.856` | `0.856` |
| `n_minus_1` | `0.952` | `0.990` |
| `cascade_prevent` | `0.000` | `0.990` |
| `multi_stage_cascade` | `0.000` | `0.9156444` |

Safety:
- base failures: `10`
- SFT failures: `0`
- SFT safety pass: `true`

Most important result:
- the base model collapsed on the hard cascade tasks because it often produced invalid or unverified actions
- the SFT model completed all evaluated episodes safely

## Final SFT Scores To Report

Main seed block `0..4`, `5` episodes per task:
- `cascade_prevent`: `0.990`
- `multi_stage_cascade`: `0.9156444`
- `n_minus_1`: `0.990`
- `single_fault`: `0.856`
- failures: `0`
- safety pass: `true`

Unseen seed block `100..102`, `3` episodes per task:
- `cascade_prevent`: `0.990`
- `multi_stage_cascade`: `0.9069863`
- `n_minus_1`: `0.9222223`
- `single_fault`: `0.830`
- failures: `0`
- safety pass: `true`

## Action Behavior

The final SFT action profile was sensible for the constrained verified-candidate setup.

Main seed block action counts:
- `single_fault`: `do_nothing=2`, `redispatch=44`
- `n_minus_1`: `do_nothing=16`, `reconnect_line=5`, `redispatch=79`
- `cascade_prevent`: `disconnect_line=10`, `do_nothing=132`, `reconnect_line=8`
- `multi_stage_cascade`: `disconnect_line=7`, `do_nothing=126`, `reconnect_line=17`

Interpretation:
- `n_minus_1` became much more active and threshold-aware than earlier versions
- cascade tasks remained conservative, but in a useful way: safe verified actions instead of invalid ones

## Known Limitation

`single_fault` is still the weakest task.

Current evidence suggests the bottleneck is not just the model. In many weak seeds, the available one-step redispatch candidates do not expose an action that actually clears the target `max_rho < 0.80`.

So the current limitation is best described as:
- SFT fixed action validity and protocol adherence
- but `single_fault` still appears constrained by the candidate/action space

## GRPO Outcome In Context

Completed GRPO runs were technically successful but did not improve over SFT.

Completed GRPO results:
- local compact GRPO matched SFT on the main seed block
- local compact GRPO slightly regressed on `single_fault` for unseen seeds
- focused HF Jobs `multi_stage_cascade` GRPO matched the SFT multistage score exactly

This means:
- SFT is the best submission model
- GRPO is a real and working extension of the project
- completed GRPO runs did not produce evaluated policy gains over SFT

## Final Conclusion

The strongest honest result is:
- the base model is unreliable on the hard Grid2Op tasks
- the SFT model fixes the action protocol problem and strongly improves benchmark performance
- the SFT model stays safe on unseen seeds
- completed GRPO work strengthened the project technically, but did not beat SFT

That is the final evaluation story for the submission.