arxiv:2511.10984

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Published on Nov 14, 2025

· Submitted by

taesiri on Nov 17, 2025

ByteDance Seed

Upvote

Authors:

Jingzhe Ding ,

Xiang Gao ,

Ge Zhang ,

Abstract

A new benchmark DiscoX and evaluation system Metric-S are introduced to assess discourse-level and expert-level Chinese-English translation, highlighting the challenges in achieving professional-grade machine translation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

View arXiv page View PDF Project page GitHub 11 auto Add to collection

Community

taesiri

Paper submitter Nov 17, 2025

DiscoX benchmarks discourse-level Chinese–English translation in expert domains with 200 texts across seven domains, plus Metric-S for reference-free assessment; results show LLMs lag behind humans, highlighting translation challenges.

librarian-bot

Nov 18, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2511.10984

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.10984 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.10984 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.10984 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.