arxiv:2511.10262

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Published on Apr 17

· Submitted by

ZhangHe on Apr 21

Upvote

Authors:

He Zhang ,

Abstract

Current full-duplex speech language models struggle with multi-round conversations due to inconsistent performance across different evaluation dimensions, necessitating comprehensive benchmarking.

AI-generated summary

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench

View arXiv page View PDF GitHub 3 Add to collection

Community

Jeff0918

Paper author Paper submitter about 21 hours ago

We present MTR-DuplexBench, the comprehensive benchmark for evaluating full-duplex speech language models across multi-round conversations. Our benchmark evaluates models on four critical dimensions: Conversational Features (smooth-turntaking,interruption, pause handling, background), Instruction Following, Safety, and Dialogue Quality. We evaluate several speech models and reveal significant gaps in their ability to handle real-world conversational dynamics. The dataset and evaluation code are publicly available at https://huggingface.co/datasets/Jeff0918/MTR-DuplexBench and https://github.com/ZhangHe0918/MTR-DuplexBench.

librarian-bot

about 10 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2511.10262

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.10262 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.10262 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.