Skywork MedArena LoRA v2 - Best Checkpoint

🏆 This is the BEST performing checkpoint from the Skywork MedArena LoRA training, achieving 61.2% accuracy on the test set.

Model Performance

Test Accuracy: 0.6120 (306/500 correct predictions)
Score Margin: 0.8204 (higher = more confident predictions)
Score Std: 3.6615
Training Step: 504 (optimal stopping point, ~8 epochs)

Model Details

Base Model: Skywork/Skywork-Reward-V2-Llama-3.1-8B
Training Dataset: kewu93/MedArena-0909 (medical preference pairs)
Training Method: LoRA (Low-Rank Adaptation) fine-tuning
LoRA Rank (r): 16
LoRA Alpha: 32
Max Length: 2048 tokens
Optimal Training: ~8 epochs (checkpoint-504 vs final checkpoint-630)

Why This Checkpoint?

This checkpoint-504 was identified as the optimal model through comprehensive evaluation:

Highest Accuracy: 61.20% vs 60.60% for the final checkpoint
Best Confidence: Score margin of 0.8204 (vs 0.5030 for final)
Optimal Training: Peaked at step 504, avoiding overfitting
Efficiency: 20% less training time than full 10 epochs

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Skywork/Skywork-Reward-V2-Llama-3.1-8B",
    num_labels=1,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-V2-Llama-3.1-8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kewu93/skywork-medarena-lora-v2")

# Prepare input (use conversation format)
conversation = [
    {"role": "user", "content": "What are the symptoms of diabetes?"},
    {"role": "assistant", "content": "Common symptoms of diabetes include frequent urination, excessive thirst, unexplained weight loss, fatigue, blurred vision, and slow-healing cuts or wounds. If you experience these symptoms, consult a healthcare provider for proper diagnosis and treatment."}
]

# Tokenize using chat template (IMPORTANT: use this format for best results)
text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)

# Get reward score
with torch.no_grad():
    outputs = model(**inputs)
    reward_score = outputs.logits.squeeze().item()
    print(f"Reward Score: {reward_score:.4f}")

Evaluation Methodology

This model was evaluated using:

Test Set: 500 samples from MedArena-0909
Metric: Preference accuracy (how often model prefers chosen over rejected response)
Tokenization: Proper chat template formatting (critical for performance)
Comparison: Multiple checkpoints evaluated to find optimal stopping point

Training Insights

Peak Performance: Model peaked at step 504 (~8 epochs)
Overfitting Prevention: Later checkpoints showed slight performance decline
Tokenization Critical: Proper chat template formatting improves accuracy by ~10 percentage points
Confidence Matters: Higher score margin indicates more reliable predictions

Comparison with Other Versions

Version	Accuracy	Score Margin	Training
v2 (checkpoint-504)	61.20%	0.8204	8 epochs
v1 (final checkpoint)	60.60%	0.5030	10 epochs
checkpoint-500	60.40%	0.5131	~8 epochs

Medical Applications

This model is designed for:

Medical response quality assessment
Healthcare chatbot evaluation
Medical preference learning
Clinical decision support evaluation

Citation

If you use this model, please cite:

@misc{skywork-medarena-lora-v2,
  title={Skywork MedArena LoRA v2: Optimal Medical Preference Learning},
  author={kewu93},
  year={2024},
  url={https://huggingface.co/kewu93/skywork-medarena-lora-v2}
}

Files

adapter_config.json: LoRA adapter configuration
adapter_model.safetensors: LoRA adapter weights (optimal checkpoint)
tokenizer.json, tokenizer_config.json: Tokenizer files
chat_template.jinja: Chat template for proper formatting

This model represents the optimal checkpoint from comprehensive evaluation of multiple training stages, achieving the best balance of accuracy, confidence, and training efficiency.

Downloads last month: 8

Model tree for kewu93/skywork-medarena-lora-v2

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

Skywork/Skywork-Reward-V2-Llama-3.1-8B

Adapter

(2)

this model