Skywork MedArena LoRA v2 - Best Checkpoint
๐ This is the BEST performing checkpoint from the Skywork MedArena LoRA training, achieving 61.2% accuracy on the test set.
Model Performance
- Test Accuracy: 0.6120 (306/500 correct predictions)
- Score Margin: 0.8204 (higher = more confident predictions)
- Score Std: 3.6615
- Training Step: 504 (optimal stopping point, ~8 epochs)
Model Details
- Base Model: Skywork/Skywork-Reward-V2-Llama-3.1-8B
- Training Dataset: kewu93/MedArena-0909 (medical preference pairs)
- Training Method: LoRA (Low-Rank Adaptation) fine-tuning
- LoRA Rank (r): 16
- LoRA Alpha: 32
- Max Length: 2048 tokens
- Optimal Training: ~8 epochs (checkpoint-504 vs final checkpoint-630)
Why This Checkpoint?
This checkpoint-504 was identified as the optimal model through comprehensive evaluation:
- Highest Accuracy: 61.20% vs 60.60% for the final checkpoint
- Best Confidence: Score margin of 0.8204 (vs 0.5030 for final)
- Optimal Training: Peaked at step 504, avoiding overfitting
- Efficiency: 20% less training time than full 10 epochs
Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
base_model = AutoModelForSequenceClassification.from_pretrained(
"Skywork/Skywork-Reward-V2-Llama-3.1-8B",
num_labels=1,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-V2-Llama-3.1-8B")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kewu93/skywork-medarena-lora-v2")
# Prepare input (use conversation format)
conversation = [
{"role": "user", "content": "What are the symptoms of diabetes?"},
{"role": "assistant", "content": "Common symptoms of diabetes include frequent urination, excessive thirst, unexplained weight loss, fatigue, blurred vision, and slow-healing cuts or wounds. If you experience these symptoms, consult a healthcare provider for proper diagnosis and treatment."}
]
# Tokenize using chat template (IMPORTANT: use this format for best results)
text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
# Get reward score
with torch.no_grad():
outputs = model(**inputs)
reward_score = outputs.logits.squeeze().item()
print(f"Reward Score: {reward_score:.4f}")
Evaluation Methodology
This model was evaluated using:
- Test Set: 500 samples from MedArena-0909
- Metric: Preference accuracy (how often model prefers chosen over rejected response)
- Tokenization: Proper chat template formatting (critical for performance)
- Comparison: Multiple checkpoints evaluated to find optimal stopping point
Training Insights
- Peak Performance: Model peaked at step 504 (~8 epochs)
- Overfitting Prevention: Later checkpoints showed slight performance decline
- Tokenization Critical: Proper chat template formatting improves accuracy by ~10 percentage points
- Confidence Matters: Higher score margin indicates more reliable predictions
Comparison with Other Versions
| Version | Accuracy | Score Margin | Training |
|---|---|---|---|
| v2 (checkpoint-504) | 61.20% | 0.8204 | 8 epochs |
| v1 (final checkpoint) | 60.60% | 0.5030 | 10 epochs |
| checkpoint-500 | 60.40% | 0.5131 | ~8 epochs |
Medical Applications
This model is designed for:
- Medical response quality assessment
- Healthcare chatbot evaluation
- Medical preference learning
- Clinical decision support evaluation
Citation
If you use this model, please cite:
@misc{skywork-medarena-lora-v2,
title={Skywork MedArena LoRA v2: Optimal Medical Preference Learning},
author={kewu93},
year={2024},
url={https://huggingface.co/kewu93/skywork-medarena-lora-v2}
}
Files
adapter_config.json: LoRA adapter configurationadapter_model.safetensors: LoRA adapter weights (optimal checkpoint)tokenizer.json,tokenizer_config.json: Tokenizer fileschat_template.jinja: Chat template for proper formatting
This model represents the optimal checkpoint from comprehensive evaluation of multiple training stages, achieving the best balance of accuracy, confidence, and training efficiency.
- Downloads last month
- 8
Model tree for kewu93/skywork-medarena-lora-v2
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct
Finetuned
Skywork/Skywork-Reward-V2-Llama-3.1-8B