MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Paper
•
2501.03468
•
Published
This model is a fine-tuned version of meta-llama/Llama-3.2-3B-Instruct on the MTRAG (Multi-Turn RAG) benchmark dataset.
Fine-tuned for multi-turn conversational QA with RAG capabilities, supporting:
MTRAG benchmark: 673 training examples (merged train+validation) from multi-turn conversational QA tasks across 4 domains
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "YOUR_REPO_ID_HERE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Example: Multi-turn QA
prompt = """<|system|>
You are a helpful retrieval-augmented assistant. Use only the provided contexts to answer. If the question cannot be answered, output exactly NO_ANSWER.
<|end_of_text|>
<|user|>
<ctx>
[Your context here]
</ctx>
Question: What is the capital of France?
Answer:<|end_of_text|>
<|assistant|>
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
If you use this model, please cite the MTRAG paper:
@misc{katsis2025mtrag,
title={MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems},
author={Yannis Katsis and Sara Rosenthal and Kshitij Fadnis and Chulaka Gunasekara and Young-Suk Lee and Lucian Popa and Vraj Shah and Huaiyu Zhu and Danish Contractor and Marina Danilevsky},
year={2025},
eprint={2501.03468},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.03468},
}
Training code and dataset: https://github.com/clulab/semeval2026-task8
Base model
meta-llama/Llama-3.2-3B-Instruct