Skywork MedArena LoRA v2 - Best Checkpoint

๐Ÿ† This is the BEST performing checkpoint from the Skywork MedArena LoRA training, achieving 61.2% accuracy on the test set.

Model Performance

  • Test Accuracy: 0.6120 (306/500 correct predictions)
  • Score Margin: 0.8204 (higher = more confident predictions)
  • Score Std: 3.6615
  • Training Step: 504 (optimal stopping point, ~8 epochs)

Model Details

  • Base Model: Skywork/Skywork-Reward-V2-Llama-3.1-8B
  • Training Dataset: kewu93/MedArena-0909 (medical preference pairs)
  • Training Method: LoRA (Low-Rank Adaptation) fine-tuning
  • LoRA Rank (r): 16
  • LoRA Alpha: 32
  • Max Length: 2048 tokens
  • Optimal Training: ~8 epochs (checkpoint-504 vs final checkpoint-630)

Why This Checkpoint?

This checkpoint-504 was identified as the optimal model through comprehensive evaluation:

  1. Highest Accuracy: 61.20% vs 60.60% for the final checkpoint
  2. Best Confidence: Score margin of 0.8204 (vs 0.5030 for final)
  3. Optimal Training: Peaked at step 504, avoiding overfitting
  4. Efficiency: 20% less training time than full 10 epochs

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Skywork/Skywork-Reward-V2-Llama-3.1-8B",
    num_labels=1,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-V2-Llama-3.1-8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kewu93/skywork-medarena-lora-v2")

# Prepare input (use conversation format)
conversation = [
    {"role": "user", "content": "What are the symptoms of diabetes?"},
    {"role": "assistant", "content": "Common symptoms of diabetes include frequent urination, excessive thirst, unexplained weight loss, fatigue, blurred vision, and slow-healing cuts or wounds. If you experience these symptoms, consult a healthcare provider for proper diagnosis and treatment."}
]

# Tokenize using chat template (IMPORTANT: use this format for best results)
text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)

# Get reward score
with torch.no_grad():
    outputs = model(**inputs)
    reward_score = outputs.logits.squeeze().item()
    print(f"Reward Score: {reward_score:.4f}")

Evaluation Methodology

This model was evaluated using:

  • Test Set: 500 samples from MedArena-0909
  • Metric: Preference accuracy (how often model prefers chosen over rejected response)
  • Tokenization: Proper chat template formatting (critical for performance)
  • Comparison: Multiple checkpoints evaluated to find optimal stopping point

Training Insights

  • Peak Performance: Model peaked at step 504 (~8 epochs)
  • Overfitting Prevention: Later checkpoints showed slight performance decline
  • Tokenization Critical: Proper chat template formatting improves accuracy by ~10 percentage points
  • Confidence Matters: Higher score margin indicates more reliable predictions

Comparison with Other Versions

Version Accuracy Score Margin Training
v2 (checkpoint-504) 61.20% 0.8204 8 epochs
v1 (final checkpoint) 60.60% 0.5030 10 epochs
checkpoint-500 60.40% 0.5131 ~8 epochs

Medical Applications

This model is designed for:

  • Medical response quality assessment
  • Healthcare chatbot evaluation
  • Medical preference learning
  • Clinical decision support evaluation

Citation

If you use this model, please cite:

@misc{skywork-medarena-lora-v2,
  title={Skywork MedArena LoRA v2: Optimal Medical Preference Learning},
  author={kewu93},
  year={2024},
  url={https://huggingface.co/kewu93/skywork-medarena-lora-v2}
}

Files

  • adapter_config.json: LoRA adapter configuration
  • adapter_model.safetensors: LoRA adapter weights (optimal checkpoint)
  • tokenizer.json, tokenizer_config.json: Tokenizer files
  • chat_template.jinja: Chat template for proper formatting

This model represents the optimal checkpoint from comprehensive evaluation of multiple training stages, achieving the best balance of accuracy, confidence, and training efficiency.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kewu93/skywork-medarena-lora-v2