SWE-CARE-RM

This model is a custom reward model built on top of Qwen3-8B with:

a merged LoRA adapter
an additional projector head
a scalar reward output in [0, 1]

The model is designed to score the quality of a review conditioned on:

an issue / problem statement
a code patch
a candidate review

A higher score means the model considers the review better under the given issue and patch.

Model Architecture

The model consists of:

base model: Qwen3-8B
adaptation: LoRA
reward head: a custom MLP projector
final score: sigmoid(projector(last_hidden_state[:, -1]))

This repository contains the merged decoder weights together with projector.pth.

Input Format

The model expects three text fields:

issue
patch
review

During inference, the input is formatted as:

<issue>{issue}</issue><patch>{patch}</patch><review>{review}<review>

The score is computed from the last token hidden state.

Quick Start

from pathlib import Path
import json

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer


MODEL_DIR = "codefuse-ai/SWE-CARE-RM"
MAX_SEQ_LEN = 51200
MIN_REVIEW_LEN = 4096
TRUST_REMOTE_CODE = True

with open(f"{MODEL_DIR}/data_sample.jsonl", "r") as fr:
    for line in fr:
        json_data = json.loads(line)
        break

SAMPLE = {
    "issue": json_data['problem_statement'],
    "patch": json_data['patch_to_review'],
    "review": json_data['pos_review'][0]
}

class Projector(nn.Module):
    def __init__(self, arch, input_size, hidden_size, use_bf16):
        super().__init__()
        depth = int(arch[len("mlp"): arch.index("x_relu")])
        layers = [nn.Linear(input_size, hidden_size).bfloat16() if use_bf16 else
nn.Linear(input_size, hidden_size)]
        for _ in range(1, depth):
            layers.append(nn.ReLU())
            layers.append(nn.Linear(hidden_size, 1).bfloat16() if use_bf16 else
nn.Linear(hidden_size, 1))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)


def resolve_dtype(dtype_name):
    if dtype_name in {"bf16", "bfloat16"}:
        return torch.bfloat16
    if dtype_name in {"fp16", "float16"}:
        return torch.float16
    return torch.float32


def infer_proj_arch(projector_state_dict):
    linear_weight_keys = [k for k in projector_state_dict if k.startswith("model.")
and k.endswith(".weight")]
    return f"mlp{len(linear_weight_keys)}x_relu"


def process_one(issue_ids, issue_masks, patch_ids, patch_masks, review_ids,
review_masks, max_len, min_review_len):
    review_keep = min(min_review_len, len(review_ids))
    remain_for_patch = max(max_len - len(issue_ids) - review_keep, 0)
    patch_keep = min(len(patch_ids), remain_for_patch)

    ids_all = issue_ids + patch_ids[:patch_keep] + review_ids[-review_keep:]
    masks_all = issue_masks + patch_masks[:patch_keep] + review_masks[-review_keep:]

    if len(ids_all) < max_len:
        pad_len = max_len - len(ids_all)
        ids_all = [0] * pad_len + ids_all
        masks_all = [0] * pad_len + masks_all

    return ids_all[:max_len], masks_all[:max_len]


reward_config = {}
reward_config_path = Path(MODEL_DIR) / "reward_config.json"
if reward_config_path.exists():
    reward_config = json.load(open(reward_config_path, "r", encoding="utf-8"))

projector_path = Path(MODEL_DIR) / "projector.pth"
projector_state_dict = torch.load(projector_path, map_location="cpu")
proj_arch = reward_config.get("proj_arch") or infer_proj_arch(projector_state_dict)
torch_dtype = resolve_dtype(reward_config.get("torch_dtype") or "bfloat16")
attn_implementation = reward_config.get("attn_implementation")

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR,
trust_remote_code=TRUST_REMOTE_CODE, padding_side="left")

model_kwargs = {"trust_remote_code": TRUST_REMOTE_CODE, "torch_dtype": torch_dtype}
if attn_implementation:
    model_kwargs["attn_implementation"] = attn_implementation
decoder = AutoModelForCausalLM.from_pretrained(MODEL_DIR, **model_kwargs)

projector = Projector(proj_arch, decoder.config.hidden_size,
decoder.config.hidden_size, torch_dtype == torch.bfloat16)
projector.load_state_dict(projector_state_dict)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
decoder.to(device).eval()
projector.to(device).eval()

issue_inputs = tokenizer(f"<issue>{SAMPLE['issue']}</issue>", padding=False,
truncation="longest_first")
patch_inputs = tokenizer(f"<patch>{SAMPLE['patch']}</patch>", padding=False,
truncation="longest_first")
review_inputs = tokenizer(SAMPLE["review"], padding=False, truncation="longest_first")

input_ids, attention_mask = process_one(
    issue_inputs["input_ids"],
    issue_inputs["attention_mask"],
    patch_inputs["input_ids"],
    patch_inputs["attention_mask"],
    review_inputs["input_ids"],
    review_inputs["attention_mask"],
    max_len=MAX_SEQ_LEN,
    min_review_len=MIN_REVIEW_LEN,
)

inputs = {
    "input_ids": torch.tensor([input_ids], dtype=torch.long, device=device),
    "attention_mask": torch.tensor([attention_mask], dtype=torch.long, device=device),
}

with torch.no_grad():
    hidden_state = decoder(**inputs, output_hidden_states=True).hidden_states[-1]
    reward = torch.sigmoid(projector(hidden_state).squeeze(-1)[:, -1]).item()

print(reward)

Output

The model outputs a single scalar reward score in [0, 1].

Typical interpretation:

higher score: better review quality
lower score: worse review quality

This score is best used for:

ranking candidate reviews
pairwise comparison
reward modeling in downstream training or reranking

Intended Use

This model is intended for:

code review quality scoring
reward modeling for review generation
reranking multiple candidate reviews for the same issue and patch

Limitations

The score is relative, not an absolute guarantee of correctness.
Long-input truncation may affect results.
The model should not be used as the only signal for production-critical review decisions.

Citation

If you use this model, please cite SWE-CARE as appropriate.

@misc{guo2025codefusecrbenchcomprehensivenessawarebenchmarkendtoend,
      title={CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects}, 
      author={Hanyang Guo and Xunjin Zheng and Zihan Liao and Hang Yu and Peng DI and Ziyin Zhang and Hong-Ning Dai},
      year={2025},
      eprint={2509.14856},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2509.14856}, 
}