Overview

Abstract

Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present MINT-Bench, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation.

MINT-Bench Framework

MINT-Bench Overview

Figure 1: Overview of MINT-Bench. Rather than treating instruction-following TTS evaluation as a flat collection of prompts, MINT-Bench formulates it as a structured benchmark construction and evaluation problem. The framework consists of three tightly coupled components: a Hierarchical Multi-axis Taxonomy that organizes benchmark coverage, a controlled Three-stage Data Construction Pipeline that instantiates valid benchmark cases into natural-language items, and a Hierarchical Hybrid Evaluation Protocol that assesses synthesized speech from multiple complementary perspectives.

Hierarchical Hybrid Evaluation Protocol

Evaluation Pipeline

Figure 2: Hierarchical hybrid evaluation protocol of MINT-Bench. The pipeline progressively evaluates content consistency, instruction following, perceptual quality, and timbre diversity.

Leaderboard

Overall

Note: The main score in bold is the penalized score (Instuction Following / Perceptual Quality) factoring in content consistency. The smaller text below shows the raw score (unpenalized), Word Error Rate (WER), and the content consistency pass rate (Pass).
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.44 / 3.66
raw 2.68 / 4.02
WER 1.4% | Pass 91.0%
2.33 / 3.37
raw 2.62 / 3.78
WER 1.7% | Pass 89.1%
2.59 / 3.96
raw 2.84 / 4.34
WER 1.2% | Pass 91.3%
2.50 / 3.74
raw 2.73 / 4.08
WER 1.3% | Pass 91.7%
2.53 / 3.81
raw 2.79 / 4.21
WER 1.5% | Pass 90.5%
2.65 / 4.07
raw 2.89 / 4.44
WER 0.8% | Pass 91.7%
2.50 / 3.64
raw 2.76 / 4.03
WER 1.6% | Pass 90.3%
2.24 / 3.54
raw 2.78 / 4.39
WER 2.1% | Pass 80.6%
2.46 / 3.94
raw 2.81 / 4.50
WER 1.2% | Pass 87.5%
2.31 / 3.25
raw 2.47 / 3.47
WER 0.8% | Pass 93.8%
2.28 / 3.38
raw 2.63 / 3.90
WER 1.1% | Pass 86.7%
2.17 / 3.22
raw 2.50 / 3.70
WER 1.5% | Pass 86.9%
1.95 / 2.55
raw 1.95 / 2.55
WER nan% | Pass 100.0%
2.62 / 4.08
raw 2.62 / 4.08
WER nan% | Pass 100.0%
2.56 / 3.94
raw 2.56 / 3.94
WER nan% | Pass 100.0%
2.31 / 3.56
raw 2.31 / 3.56
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.39 / 3.45
raw 2.65 / 3.83
WER 1.4% | Pass 90.1%
2.34 / 3.17
raw 2.61 / 3.54
WER 1.5% | Pass 89.5%
2.53 / 3.73
raw 2.81 / 4.15
WER 1.3% | Pass 90.1%
2.47 / 3.49
raw 2.72 / 3.85
WER 1.5% | Pass 90.7%
2.51 / 3.64
raw 2.80 / 4.06
WER 1.6% | Pass 89.6%
2.52 / 3.60
raw 2.83 / 4.06
WER 1.2% | Pass 88.9%
2.44 / 3.32
raw 2.62 / 3.57
WER 1.2% | Pass 93.1%
2.35 / 3.80
raw 2.92 / 4.72
WER 1.4% | Pass 80.6%
2.23 / 3.38
raw 2.75 / 4.16
WER 1.6% | Pass 81.2%
2.40 / 3.57
raw 2.56 / 3.81
WER 0.8% | Pass 93.8%
2.22 / 3.26
raw 2.57 / 3.77
WER 1.1% | Pass 86.7%
2.05 / 2.91
raw 2.40 / 3.42
WER 1.3% | Pass 85.1%
1.90 / 2.40
raw 1.80 / 2.40
WER nan% | Pass 100.0%
2.71 / 4.21
raw 2.71 / 4.21
WER nan% | Pass 100.0%
2.31 / 3.50
raw 2.31 / 3.50
WER nan% | Pass 100.0%
2.35 / 3.58
raw 2.35 / 3.58
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.23 / 3.12
raw 2.50 / 3.50
WER 1.4% | Pass 89.2%
2.53 / 3.39
raw 2.74 / 3.67
WER 1.1% | Pass 92.5%
2.23 / 3.09
raw 2.53 / 3.51
WER 1.6% | Pass 88.1%
2.49 / 3.41
raw 2.64 / 3.61
WER 0.8% | Pass 94.5%
2.34 / 3.19
raw 2.52 / 3.43
WER 1.0% | Pass 93.0%
2.62 / 3.74
raw 2.86 / 4.09
WER 1.2% | Pass 91.5%
2.24 / 2.92
raw 2.58 / 3.36
WER 1.7% | Pass 87.0%
1.81 / 2.37
raw 2.14 / 2.81
WER 2.1% | Pass 84.5%
2.18 / 3.09
raw 2.53 / 3.59
WER 1.9% | Pass 86.0%
2.12 / 3.05
raw 2.57 / 3.70
WER 2.4% | Pass 82.5%
2.48 / 3.92
raw 2.80 / 4.43
WER 1.5% | Pass 88.5%
2.42 / 3.66
raw 2.72 / 4.11
WER 1.4% | Pass 89.0%
0.77 / 0.77
raw 1.00 / 1.00
WER 3.5% | Pass 76.5%
1.43 / 1.90
raw 1.75 / 2.33
WER 2.8% | Pass 81.5%
1.21 / 1.55
raw 1.42 / 1.81
WER 2.0% | Pass 85.5%
2.21 / 3.30
raw 2.56 / 3.81
WER 1.8% | Pass 86.5%
Qwen3TTS-12Hz-1.7B-VD
2.12 / 2.94
raw 2.33 / 3.24
WER 1.5% | Pass 90.9%
2.28 / 3.16
raw 2.55 / 3.53
WER 1.3% | Pass 89.4%
2.16 / 3.03
raw 2.39 / 3.34
WER 1.3% | Pass 90.7%
2.20 / 3.14
raw 2.46 / 3.51
WER 1.7% | Pass 89.4%
2.13 / 2.96
raw 2.39 / 3.33
WER 2.2% | Pass 89.1%
2.44 / 3.44
raw 2.67 / 3.75
WER 0.9% | Pass 91.7%
2.35 / 3.27
raw 2.53 / 3.51
WER 1.0% | Pass 93.1%
1.57 / 1.99
raw 1.89 / 2.39
WER 0.9% | Pass 83.3%
2.11 / 2.90
raw 2.41 / 3.31
WER 1.0% | Pass 87.5%
2.15 / 2.79
raw 2.22 / 2.88
WER 0.4% | Pass 96.9%
2.06 / 2.97
raw 2.47 / 3.57
WER 1.3% | Pass 83.3%
2.12 / 2.96
raw 2.38 / 3.31
WER 1.3% | Pass 89.3%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
2.21 / 3.29
raw 2.21 / 3.29
WER nan% | Pass 100.0%
1.81 / 2.44
raw 1.81 / 2.44
WER nan% | Pass 100.0%
1.59 / 2.03
raw 1.59 / 2.03
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
1.95 / 2.62
raw 2.15 / 2.89
WER 1.4% | Pass 90.5%
2.12 / 2.82
raw 2.39 / 3.19
WER 1.5% | Pass 88.6%
2.04 / 2.77
raw 2.26 / 3.08
WER 1.4% | Pass 90.2%
2.07 / 2.85
raw 2.30 / 3.16
WER 1.7% | Pass 90.1%
1.99 / 2.65
raw 2.22 / 2.96
WER 1.6% | Pass 89.5%
2.35 / 3.20
raw 2.68 / 3.65
WER 1.3% | Pass 87.5%
2.27 / 3.02
raw 2.44 / 3.25
WER 1.1% | Pass 93.0%
1.53 / 1.99
raw 1.83 / 2.39
WER 0.8% | Pass 83.3%
1.57 / 2.02
raw 2.09 / 2.69
WER 2.9% | Pass 75.0%
2.07 / 2.83
raw 2.07 / 2.83
WER 0.1% | Pass 100.0%
2.06 / 3.00
raw 2.47 / 3.60
WER 1.4% | Pass 83.3%
2.09 / 2.93
raw 2.32 / 3.25
WER 1.2% | Pass 90.2%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.08 / 1.17
raw 1.08 / 1.17
WER nan% | Pass 100.0%
1.26 / 1.48
raw 1.26 / 1.48
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
1.90 / 2.53
raw 2.20 / 2.93
WER 2.8% | Pass 86.4%
2.01 / 2.61
raw 2.38 / 3.08
WER 2.8% | Pass 84.6%
1.99 / 2.72
raw 2.30 / 3.14
WER 2.7% | Pass 86.5%
1.97 / 2.52
raw 2.25 / 2.89
WER 2.2% | Pass 87.5%
1.90 / 2.57
raw 2.27 / 3.06
WER 3.5% | Pass 83.8%
2.29 / 3.09
raw 2.54 / 3.42
WER 0.9% | Pass 90.1%
2.01 / 2.78
raw 2.34 / 3.24
WER 4.5% | Pass 85.7%
1.71 / 2.35
raw 2.36 / 3.25
WER 4.0% | Pass 72.2%
1.90 / 2.64
raw 2.34 / 3.25
WER 2.4% | Pass 81.2%
2.11 / 2.84
raw 2.25 / 3.03
WER 0.9% | Pass 93.8%
1.76 / 2.17
raw 2.03 / 2.50
WER 1.0% | Pass 86.7%
1.74 / 2.28
raw 2.20 / 2.90
WER 3.6% | Pass 78.7%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
2.04 / 2.92
raw 2.04 / 2.92
WER nan% | Pass 100.0%
1.97 / 2.77
raw 1.97 / 2.77
WER nan% | Pass 100.0%
1.16 / 1.28
raw 1.16 / 1.28
WER nan% | Pass 100.0%
Hume-Octave1
1.86 / 2.51
raw 2.07 / 2.79
WER 1.6% | Pass 89.9%
2.08 / 2.86
raw 2.39 / 3.30
WER 1.8% | Pass 86.7%
1.95 / 2.69
raw 2.16 / 2.98
WER 1.4% | Pass 89.5%
2.10 / 3.00
raw 2.38 / 3.41
WER 2.0% | Pass 88.1%
2.08 / 2.98
raw 2.27 / 3.26
WER 1.3% | Pass 91.4%
2.22 / 3.11
raw 2.57 / 3.60
WER 1.3% | Pass 86.6%
2.00 / 2.71
raw 2.17 / 2.94
WER 1.2% | Pass 92.2%
1.65 / 1.99
raw 1.92 / 2.31
WER 1.9% | Pass 86.1%
1.52 / 1.62
raw 1.81 / 1.94
WER 1.9% | Pass 83.9%
1.44 / 1.68
raw 1.60 / 1.87
WER 1.2% | Pass 90.0%
1.97 / 2.56
raw 2.37 / 3.07
WER 1.3% | Pass 83.3%
1.74 / 2.28
raw 2.03 / 2.66
WER 1.8% | Pass 85.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.03 / 1.03
raw 1.03 / 1.03
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
1.75 / 2.22
raw 2.13 / 2.70
WER 3.5% | Pass 82.2%
1.80 / 2.17
raw 2.29 / 2.76
WER 4.1% | Pass 78.6%
1.83 / 2.37
raw 2.24 / 2.91
WER 3.1% | Pass 81.5%
1.87 / 2.42
raw 2.24 / 2.89
WER 3.2% | Pass 83.8%
1.79 / 2.28
raw 2.22 / 2.83
WER 4.5% | Pass 80.3%
2.03 / 2.49
raw 2.47 / 3.04
WER 2.0% | Pass 81.9%
1.95 / 2.49
raw 2.31 / 2.94
WER 3.3% | Pass 84.5%
1.44 / 1.81
raw 2.00 / 2.50
WER 3.1% | Pass 72.2%
1.25 / 1.46
raw 2.00 / 2.34
WER 5.0% | Pass 62.5%
1.71 / 2.25
raw 2.28 / 3.00
WER 2.9% | Pass 75.0%
1.89 / 2.53
raw 2.27 / 3.03
WER 2.1% | Pass 83.3%
1.63 / 2.13
raw 2.16 / 2.82
WER 3.3% | Pass 75.4%
1.05 / 1.05
raw 1.05 / 1.05
WER nan% | Pass 100.0%
1.96 / 2.54
raw 1.96 / 2.54
WER nan% | Pass 100.0%
1.47 / 1.91
raw 1.47 / 1.91
WER nan% | Pass 100.0%
1.03 / 1.03
raw 1.03 / 1.03
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.69 / 2.15
raw 1.89 / 2.40
WER 1.7% | Pass 89.4%
1.70 / 2.09
raw 1.91 / 2.35
WER 1.9% | Pass 88.6%
1.78 / 2.30
raw 1.98 / 2.56
WER 1.4% | Pass 90.0%
1.63 / 2.07
raw 1.81 / 2.31
WER 1.6% | Pass 90.0%
1.83 / 2.42
raw 2.03 / 2.68
WER 2.0% | Pass 90.2%
1.74 / 2.16
raw 1.96 / 2.43
WER 1.4% | Pass 88.9%
1.81 / 2.22
raw 1.97 / 2.42
WER 1.4% | Pass 91.7%
1.72 / 2.24
raw 2.14 / 2.78
WER 1.3% | Pass 80.6%
1.74 / 2.19
raw 2.06 / 2.59
WER 1.4% | Pass 84.4%
1.87 / 2.40
raw 2.00 / 2.57
WER 0.9% | Pass 93.3%
1.61 / 1.89
raw 2.10 / 2.47
WER 1.5% | Pass 76.7%
1.47 / 1.79
raw 1.78 / 2.16
WER 2.2% | Pass 82.8%
1.30 / 1.30
raw 1.30 / 1.30
WER nan% | Pass 100.0%
1.88 / 2.71
raw 1.88 / 2.71
WER nan% | Pass 100.0%
1.62 / 2.19
raw 1.62 / 2.19
WER nan% | Pass 100.0%
1.59 / 2.03
raw 1.59 / 2.03
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.51 / 1.79
raw 1.70 / 2.01
WER 1.8% | Pass 89.1%
1.68 / 1.94
raw 1.91 / 2.20
WER 1.8% | Pass 88.2%
1.58 / 1.89
raw 1.75 / 2.10
WER 1.5% | Pass 90.0%
1.70 / 2.17
raw 1.93 / 2.47
WER 2.1% | Pass 87.8%
1.61 / 1.87
raw 1.79 / 2.08
WER 2.0% | Pass 90.1%
1.99 / 2.43
raw 2.17 / 2.65
WER 0.8% | Pass 91.7%
1.41 / 1.52
raw 1.52 / 1.63
WER 1.0% | Pass 93.0%
1.60 / 2.06
raw 1.92 / 2.47
WER 1.4% | Pass 83.3%
1.27 / 1.32
raw 1.56 / 1.62
WER 2.0% | Pass 81.2%
1.20 / 1.20
raw 1.38 / 1.38
WER 1.6% | Pass 87.5%
1.56 / 1.99
raw 1.80 / 2.30
WER 1.0% | Pass 86.7%
1.31 / 1.55
raw 1.62 / 1.91
WER 2.0% | Pass 81.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.03 / 1.03
raw 1.03 / 1.03
WER nan% | Pass 100.0%
1.12 / 1.25
raw 1.12 / 1.25
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
1.41 / 1.70
raw 1.63 / 1.96
WER 2.4% | Pass 86.4%
1.55 / 1.91
raw 1.80 / 2.21
WER 2.5% | Pass 86.2%
1.46 / 1.75
raw 1.68 / 2.02
WER 2.2% | Pass 86.7%
1.62 / 2.10
raw 1.85 / 2.39
WER 2.3% | Pass 87.8%
1.46 / 1.88
raw 1.76 / 2.26
WER 3.5% | Pass 83.3%
1.69 / 2.08
raw 1.93 / 2.38
WER 1.4% | Pass 87.5%
1.41 / 1.48
raw 1.54 / 1.61
WER 1.3% | Pass 91.7%
1.33 / 1.55
raw 1.62 / 1.88
WER 2.1% | Pass 82.4%
1.08 / 1.12
raw 1.50 / 1.56
WER 2.9% | Pass 71.9%
1.23 / 1.29
raw 1.41 / 1.47
WER 2.2% | Pass 87.5%
1.20 / 1.33
raw 1.57 / 1.73
WER 1.4% | Pass 76.7%
1.27 / 1.54
raw 1.58 / 1.91
WER 2.4% | Pass 80.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.04 / 1.04
raw 1.04 / 1.04
WER nan% | Pass 100.0%
1.06 / 1.13
raw 1.06 / 1.13
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Parler-TTS Large
1.38 / 1.52
raw 1.72 / 1.90
WER 4.4% | Pass 80.2%
1.61 / 1.71
raw 2.02 / 2.14
WER 3.8% | Pass 79.7%
1.49 / 1.68
raw 1.84 / 2.08
WER 4.0% | Pass 81.0%
1.45 / 1.61
raw 1.82 / 2.03
WER 4.5% | Pass 79.3%
1.65 / 1.85
raw 1.92 / 2.15
WER 2.5% | Pass 85.9%
1.74 / 1.88
raw 2.12 / 2.29
WER 2.6% | Pass 81.9%
1.64 / 1.76
raw 1.90 / 2.04
WER 2.7% | Pass 86.1%
0.85 / 0.88
raw 1.61 / 1.67
WER 6.9% | Pass 52.8%
1.35 / 1.55
raw 1.97 / 2.25
WER 6.4% | Pass 68.8%
1.15 / 1.39
raw 1.75 / 2.12
WER 10.5% | Pass 65.6%
0.93 / 1.03
raw 1.27 / 1.40
WER 4.4% | Pass 73.3%
0.94 / 1.01
raw 1.39 / 1.51
WER 5.6% | Pass 67.2%
1.15 / 1.15
raw 1.15 / 1.15
WER nan% | Pass 100.0%
1.92 / 2.29
raw 1.92 / 2.29
WER nan% | Pass 100.0%
1.06 / 1.12
raw 1.06 / 1.12
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Parler-TTS Mini
1.04 / 1.13
raw 1.55 / 1.69
WER 17.5% | Pass 67.1%
1.11 / 1.17
raw 1.65 / 1.73
WER 9.7% | Pass 67.7%
1.13 / 1.25
raw 1.68 / 1.87
WER 16.4% | Pass 67.2%
1.18 / 1.29
raw 1.71 / 1.87
WER 12.7% | Pass 69.1%
1.28 / 1.39
raw 1.73 / 1.88
WER 18.3% | Pass 74.0%
1.00 / 1.04
raw 1.67 / 1.74
WER 13.2% | Pass 59.7%
1.23 / 1.40
raw 1.64 / 1.86
WER 8.8% | Pass 75.0%
0.75 / 0.82
raw 1.42 / 1.56
WER 15.5% | Pass 52.8%
0.90 / 1.01
raw 1.69 / 1.91
WER 14.6% | Pass 53.1%
0.73 / 0.80
raw 1.47 / 1.59
WER 12.9% | Pass 50.0%
0.45 / 0.48
raw 1.13 / 1.20
WER 19.0% | Pass 40.0%
0.51 / 0.54
raw 1.19 / 1.26
WER 30.5% | Pass 42.6%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
2.33 / 3.04
raw 2.33 / 3.04
WER nan% | Pass 100.0%
1.12 / 1.25
raw 1.12 / 1.25
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Qwen3TTS-12Hz-1.7B-VD
2.16 / 3.12
raw 2.43 / 3.51
WER 1.1% | Pass 88.7%
2.29 / 3.30
raw 2.63 / 3.78
WER 0.8% | Pass 87.2%
2.23 / 3.27
raw 2.51 / 3.67
WER 1.1% | Pass 89.0%
2.27 / 3.35
raw 2.59 / 3.82
WER 1.1% | Pass 87.5%
2.33 / 3.38
raw 2.50 / 3.62
WER 0.5% | Pass 93.2%
2.69 / 4.05
raw 2.85 / 4.29
WER 0.4% | Pass 94.4%
2.25 / 3.26
raw 2.65 / 3.85
WER 2.6% | Pass 84.7%
1.64 / 2.06
raw 1.97 / 2.47
WER 1.0% | Pass 83.3%
1.98 / 2.69
raw 2.19 / 2.97
WER 0.6% | Pass 90.6%
1.85 / 2.67
raw 2.47 / 3.56
WER 1.6% | Pass 75.0%
1.89 / 2.63
raw 2.47 / 3.43
WER 1.2% | Pass 76.7%
1.89 / 2.66
raw 2.35 / 3.31
WER 1.6% | Pass 80.3%
1.20 / 1.40
raw 1.20 / 1.40
WER nan% | Pass 100.0%
2.67 / 4.25
raw 2.67 / 4.25
WER nan% | Pass 100.0%
2.03 / 2.97
raw 2.03 / 2.97
WER nan% | Pass 100.0%
1.69 / 2.34
raw 1.69 / 2.34
WER nan% | Pass 100.0%
Gemini 2.5-Flash
2.09 / 2.95
raw 2.48 / 3.51
WER 1.6% | Pass 84.3%
2.09 / 2.79
raw 2.47 / 3.30
WER 1.5% | Pass 84.4%
2.11 / 3.04
raw 2.58 / 3.71
WER 1.7% | Pass 81.9%
2.02 / 2.85
raw 2.55 / 3.60
WER 2.0% | Pass 79.3%
2.21 / 3.19
raw 2.56 / 3.68
WER 1.3% | Pass 86.5%
2.44 / 3.52
raw 2.75 / 3.96
WER 0.9% | Pass 88.9%
2.25 / 3.05
raw 2.61 / 3.54
WER 1.5% | Pass 86.1%
2.01 / 2.86
raw 2.50 / 3.56
WER 1.5% | Pass 80.6%
1.73 / 2.22
raw 2.22 / 2.84
WER 2.2% | Pass 78.1%
1.48 / 1.98
raw 2.16 / 2.88
WER 2.5% | Pass 68.8%
1.89 / 2.56
raw 2.47 / 3.33
WER 1.5% | Pass 76.7%
1.88 / 2.63
raw 2.36 / 3.31
WER 1.6% | Pass 79.3%
1.55 / 1.75
raw 1.55 / 1.75
WER nan% | Pass 100.0%
2.58 / 4.12
raw 2.58 / 4.12
WER nan% | Pass 100.0%
2.41 / 3.66
raw 2.41 / 3.66
WER nan% | Pass 100.0%
2.28 / 3.41
raw 2.28 / 3.41
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.13 / 2.93
raw 2.50 / 3.43
WER 1.8% | Pass 85.3%
2.20 / 2.85
raw 2.55 / 3.30
WER 1.5% | Pass 86.4%
2.22 / 3.11
raw 2.61 / 3.65
WER 1.7% | Pass 85.2%
2.12 / 2.91
raw 2.55 / 3.51
WER 1.8% | Pass 82.9%
2.29 / 3.09
raw 2.60 / 3.51
WER 1.1% | Pass 88.1%
2.44 / 3.43
raw 2.79 / 3.92
WER 1.1% | Pass 87.5%
2.20 / 2.93
raw 2.47 / 3.29
WER 1.0% | Pass 88.9%
2.01 / 2.82
raw 2.50 / 3.50
WER 2.6% | Pass 80.6%
1.85 / 2.46
raw 2.47 / 3.28
WER 1.6% | Pass 75.0%
2.11 / 2.84
raw 2.59 / 3.50
WER 1.8% | Pass 81.2%
1.44 / 2.12
raw 2.53 / 3.73
WER 6.1% | Pass 56.7%
1.78 / 2.31
raw 2.21 / 2.88
WER 1.4% | Pass 80.2%
1.70 / 2.15
raw 1.70 / 2.15
WER nan% | Pass 100.0%
2.71 / 4.21
raw 2.71 / 4.21
WER nan% | Pass 100.0%
2.41 / 3.53
raw 2.41 / 3.53
WER nan% | Pass 100.0%
2.31 / 3.62
raw 2.31 / 3.62
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.95 / 2.68
raw 2.09 / 2.89
WER 0.7% | Pass 93.0%
2.10 / 2.85
raw 2.27 / 3.08
WER 0.4% | Pass 92.4%
2.06 / 2.92
raw 2.21 / 3.12
WER 0.6% | Pass 93.4%
2.12 / 2.99
raw 2.29 / 3.22
WER 0.8% | Pass 92.7%
2.02 / 2.84
raw 2.09 / 2.95
WER 0.3% | Pass 96.4%
2.44 / 3.66
raw 2.58 / 3.88
WER 0.3% | Pass 94.4%
2.16 / 2.91
raw 2.32 / 3.12
WER 0.4% | Pass 93.1%
1.91 / 2.44
raw 2.08 / 2.67
WER 0.5% | Pass 91.7%
1.82 / 2.40
raw 2.16 / 2.84
WER 0.7% | Pass 84.4%
1.64 / 2.13
raw 1.88 / 2.44
WER 0.8% | Pass 87.5%
1.72 / 2.44
raw 2.07 / 2.93
WER 0.9% | Pass 83.3%
1.80 / 2.42
raw 2.05 / 2.76
WER 1.1% | Pass 87.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.67 / 2.33
raw 1.67 / 2.33
WER nan% | Pass 100.0%
1.66 / 2.22
raw 1.66 / 2.22
WER nan% | Pass 100.0%
1.06 / 1.12
raw 1.06 / 1.12
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
1.93 / 2.63
raw 2.27 / 3.09
WER 3.1% | Pass 85.2%
1.98 / 2.60
raw 2.41 / 3.16
WER 3.2% | Pass 82.4%
2.00 / 2.79
raw 2.36 / 3.30
WER 3.7% | Pass 84.7%
2.00 / 2.70
raw 2.38 / 3.20
WER 3.4% | Pass 84.3%
1.92 / 2.66
raw 2.24 / 3.09
WER 2.1% | Pass 86.1%
2.14 / 2.90
raw 2.47 / 3.34
WER 1.9% | Pass 86.8%
2.27 / 3.15
raw 2.52 / 3.49
WER 3.9% | Pass 90.1%
1.46 / 2.01
raw 2.23 / 3.06
WER 11.8% | Pass 65.7%
1.90 / 2.56
raw 2.34 / 3.16
WER 1.4% | Pass 81.2%
1.96 / 2.77
raw 2.41 / 3.41
WER 1.4% | Pass 81.2%
1.59 / 2.10
raw 2.17 / 2.87
WER 3.4% | Pass 73.3%
1.77 / 2.41
raw 2.19 / 2.98
WER 2.7% | Pass 80.8%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
2.33 / 3.67
raw 2.33 / 3.67
WER nan% | Pass 100.0%
1.97 / 2.70
raw 1.97 / 2.70
WER nan% | Pass 100.0%
1.81 / 2.34
raw 1.81 / 2.34
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
1.95 / 2.62
raw 2.10 / 2.83
WER 0.7% | Pass 92.8%
2.17 / 2.88
raw 2.35 / 3.12
WER 0.6% | Pass 92.1%
2.07 / 2.86
raw 2.19 / 3.03
WER 0.5% | Pass 94.3%
2.08 / 2.84
raw 2.23 / 3.04
WER 0.8% | Pass 93.2%
2.05 / 2.86
raw 2.18 / 3.04
WER 0.5% | Pass 94.2%
2.49 / 3.47
raw 2.56 / 3.56
WER 0.1% | Pass 97.4%
2.18 / 2.99
raw 2.35 / 3.21
WER 0.5% | Pass 93.1%
1.76 / 2.20
raw 1.86 / 2.33
WER 0.6% | Pass 94.4%
1.95 / 2.52
raw 2.16 / 2.78
WER 0.7% | Pass 90.6%
1.80 / 2.38
raw 2.06 / 2.72
WER 0.9% | Pass 87.5%
1.91 / 2.80
raw 2.52 / 3.69
WER 1.2% | Pass 75.9%
1.73 / 2.25
raw 2.02 / 2.62
WER 1.1% | Pass 85.7%
1.05 / 1.05
raw 1.05 / 1.05
WER nan% | Pass 100.0%
1.75 / 2.50
raw 1.75 / 2.50
WER nan% | Pass 100.0%
1.25 / 1.44
raw 1.25 / 1.44
WER nan% | Pass 100.0%
1.06 / 1.12
raw 1.06 / 1.12
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
1.98 / 2.61
raw 2.29 / 3.02
WER 1.7% | Pass 86.3%
2.23 / 2.78
raw 2.52 / 3.14
WER 1.5% | Pass 88.5%
1.96 / 2.58
raw 2.25 / 2.96
WER 1.6% | Pass 87.1%
2.23 / 2.84
raw 2.41 / 3.07
WER 1.2% | Pass 92.5%
2.05 / 2.52
raw 2.25 / 2.77
WER 1.3% | Pass 91.0%
2.30 / 3.02
raw 2.57 / 3.37
WER 1.4% | Pass 89.5%
2.11 / 2.89
raw 2.50 / 3.42
WER 1.8% | Pass 84.5%
1.34 / 1.57
raw 1.65 / 1.94
WER 2.1% | Pass 81.0%
1.89 / 2.48
raw 2.26 / 2.97
WER 1.9% | Pass 83.5%
1.73 / 2.30
raw 2.16 / 2.88
WER 2.2% | Pass 80.0%
1.87 / 2.52
raw 2.20 / 2.97
WER 1.8% | Pass 85.0%
2.09 / 2.98
raw 2.49 / 3.55
WER 1.9% | Pass 84.0%
0.87 / 0.95
raw 1.17 / 1.28
WER 3.1% | Pass 74.5%
1.90 / 3.01
raw 2.43 / 3.86
WER 2.5% | Pass 78.0%
1.14 / 1.42
raw 1.38 / 1.72
WER 2.0% | Pass 82.5%
2.12 / 3.22
raw 2.48 / 3.77
WER 1.8% | Pass 85.5%
Ming-omni-tts-0.5B
1.87 / 2.56
raw 2.02 / 2.77
WER 0.6% | Pass 92.5%
2.04 / 2.81
raw 2.23 / 3.07
WER 0.5% | Pass 91.4%
2.01 / 2.79
raw 2.14 / 2.97
WER 0.5% | Pass 93.9%
1.95 / 2.80
raw 2.19 / 3.15
WER 0.9% | Pass 89.0%
1.96 / 2.76
raw 2.03 / 2.86
WER 0.2% | Pass 96.8%
2.39 / 3.45
raw 2.50 / 3.60
WER 0.3% | Pass 95.7%
2.24 / 2.98
raw 2.34 / 3.11
WER 0.3% | Pass 95.8%
1.68 / 2.07
raw 1.78 / 2.19
WER 0.5% | Pass 94.4%
1.72 / 2.08
raw 1.97 / 2.38
WER 0.5% | Pass 87.5%
1.65 / 2.14
raw 1.97 / 2.55
WER 1.1% | Pass 83.9%
1.56 / 2.16
raw 1.97 / 2.72
WER 1.2% | Pass 79.3%
1.82 / 2.47
raw 2.07 / 2.81
WER 1.1% | Pass 87.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.55 / 2.00
raw 1.55 / 2.00
WER nan% | Pass 100.0%
1.35 / 1.68
raw 1.35 / 1.68
WER nan% | Pass 100.0%
1.07 / 1.13
raw 1.07 / 1.13
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
1.75 / 2.22
raw 2.12 / 2.69
WER 1.9% | Pass 82.6%
1.78 / 2.17
raw 2.31 / 2.82
WER 2.4% | Pass 76.8%
1.83 / 2.35
raw 2.20 / 2.83
WER 1.8% | Pass 83.1%
1.86 / 2.37
raw 2.35 / 2.99
WER 2.1% | Pass 79.3%
1.78 / 2.25
raw 2.20 / 2.78
WER 2.2% | Pass 80.8%
2.14 / 2.80
raw 2.44 / 3.19
WER 1.6% | Pass 87.5%
1.74 / 2.15
raw 2.24 / 2.76
WER 1.5% | Pass 77.8%
1.62 / 1.97
raw 1.94 / 2.36
WER 1.0% | Pass 83.3%
1.33 / 1.46
raw 1.58 / 1.74
WER 1.0% | Pass 83.9%
1.68 / 2.11
raw 2.06 / 2.59
WER 4.3% | Pass 81.2%
1.76 / 2.43
raw 2.20 / 3.03
WER 1.1% | Pass 80.0%
1.51 / 1.90
raw 2.00 / 2.52
WER 1.8% | Pass 75.4%
1.40 / 1.65
raw 1.40 / 1.65
WER nan% | Pass 100.0%
2.33 / 3.21
raw 2.33 / 3.21
WER nan% | Pass 100.0%
1.72 / 2.34
raw 1.72 / 2.34
WER nan% | Pass 100.0%
1.22 / 1.41
raw 1.22 / 1.41
WER nan% | Pass 100.0%
VoiceSculptor
1.59 / 2.01
raw 1.96 / 2.47
WER 9.2% | Pass 81.1%
1.55 / 1.84
raw 2.02 / 2.40
WER 11.3% | Pass 76.8%
1.62 / 2.09
raw 2.01 / 2.59
WER 9.4% | Pass 80.7%
1.63 / 2.13
raw 2.05 / 2.67
WER 4.0% | Pass 79.7%
1.41 / 1.80
raw 2.03 / 2.59
WER 21.4% | Pass 69.4%
1.87 / 2.33
raw 2.17 / 2.71
WER 8.0% | Pass 86.1%
1.69 / 2.08
raw 1.93 / 2.38
WER 5.3% | Pass 87.5%
1.60 / 1.82
raw 1.86 / 2.11
WER 2.0% | Pass 86.1%
1.78 / 2.02
raw 2.03 / 2.31
WER 3.6% | Pass 87.5%
1.25 / 1.54
raw 1.91 / 2.34
WER 20.8% | Pass 65.6%
1.42 / 1.78
raw 1.93 / 2.43
WER 1.5% | Pass 73.3%
1.59 / 2.00
raw 1.95 / 2.47
WER 4.0% | Pass 81.1%
1.25 / 1.35
raw 1.25 / 1.35
WER nan% | Pass 100.0%
2.46 / 3.46
raw 2.46 / 3.46
WER nan% | Pass 100.0%
1.75 / 2.38
raw 1.75 / 2.38
WER nan% | Pass 100.0%
1.09 / 1.16
raw 1.09 / 1.16
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.56 / 1.82
raw 1.75 / 2.04
WER 1.1% | Pass 89.5%
1.58 / 1.72
raw 1.81 / 1.97
WER 1.2% | Pass 87.3%
1.63 / 1.92
raw 1.82 / 2.15
WER 0.9% | Pass 89.5%
1.53 / 1.75
raw 1.74 / 1.99
WER 1.3% | Pass 87.9%
1.76 / 2.01
raw 1.88 / 2.14
WER 0.4% | Pass 93.7%
1.77 / 2.06
raw 2.00 / 2.32
WER 0.6% | Pass 90.3%
1.59 / 1.86
raw 1.76 / 2.06
WER 0.7% | Pass 88.9%
1.78 / 2.25
raw 2.00 / 2.53
WER 0.7% | Pass 87.5%
1.20 / 1.23
raw 1.38 / 1.41
WER 0.7% | Pass 87.5%
1.39 / 1.68
raw 1.94 / 2.34
WER 1.9% | Pass 71.9%
1.20 / 1.31
raw 1.50 / 1.63
WER 1.3% | Pass 80.0%
1.33 / 1.54
raw 1.59 / 1.84
WER 1.5% | Pass 83.6%
1.11 / 1.11
raw 1.11 / 1.11
WER nan% | Pass 100.0%
1.96 / 2.71
raw 1.96 / 2.71
WER nan% | Pass 100.0%
1.62 / 2.06
raw 1.62 / 2.06
WER nan% | Pass 100.0%
1.47 / 1.91
raw 1.47 / 1.91
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.60 / 3.96
raw 2.65 / 4.02
WER 1.6% | Pass 98.4%
2.59 / 3.87
raw 2.66 / 3.97
WER 1.7% | Pass 97.4%
2.71 / 4.17
raw 2.76 / 4.24
WER 1.2% | Pass 98.4%
2.55 / 3.88
raw 2.66 / 4.04
WER 2.1% | Pass 96.0%
2.70 / 4.00
raw 2.70 / 4.00
WER 1.0% | Pass 100.0%
2.86 / 4.55
raw 2.86 / 4.55
WER 1.0% | Pass 100.0%
2.75 / 4.27
raw 2.90 / 4.50
WER 1.6% | Pass 95.0%
2.64 / 3.91
raw 2.64 / 3.91
WER 1.2% | Pass 100.0%
2.44 / 3.56
raw 2.44 / 3.56
WER 0.0% | Pass 100.0%
2.56 / 3.89
raw 2.56 / 3.89
WER 0.9% | Pass 100.0%
3.00 / 5.00
raw 3.00 / 5.00
WER 0.0% | Pass 100.0%
2.57 / 3.97
raw 2.65 / 4.09
WER 2.4% | Pass 97.1%
1.50 / 1.83
raw 1.50 / 1.83
WER 1.6% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER 0.0% | Pass 100.0%
2.30 / 3.30
raw 2.30 / 3.30
WER nan% | Pass 100.0%
2.30 / 3.30
raw 2.30 / 3.30
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.49 / 3.72
raw 2.58 / 3.84
WER 1.8% | Pass 96.8%
2.50 / 3.62
raw 2.60 / 3.77
WER 1.8% | Pass 96.1%
2.58 / 3.87
raw 2.69 / 4.02
WER 1.7% | Pass 96.1%
2.47 / 3.62
raw 2.57 / 3.78
WER 2.6% | Pass 95.9%
2.43 / 3.58
raw 2.53 / 3.73
WER 2.0% | Pass 95.9%
2.77 / 4.14
raw 2.77 / 4.14
WER 0.7% | Pass 100.0%
2.68 / 3.99
raw 2.81 / 4.19
WER 1.2% | Pass 95.2%
2.73 / 4.18
raw 2.73 / 4.18
WER 1.1% | Pass 100.0%
2.67 / 4.00
raw 2.67 / 4.00
WER 1.2% | Pass 100.0%
2.34 / 3.42
raw 2.60 / 3.80
WER 2.0% | Pass 90.0%
3.00 / 4.78
raw 3.00 / 4.78
WER 0.0% | Pass 100.0%
2.40 / 3.60
raw 2.54 / 3.81
WER 2.5% | Pass 94.6%
1.50 / 1.83
raw 1.50 / 1.83
WER nan% | Pass 100.0%
2.14 / 3.29
raw 2.14 / 3.29
WER nan% | Pass 100.0%
2.40 / 3.60
raw 2.40 / 3.60
WER nan% | Pass 100.0%
2.50 / 3.90
raw 2.50 / 3.90
WER nan% | Pass 100.0%
Qwen3TTS-12Hz-1.7B-VD
2.47 / 3.57
raw 2.59 / 3.75
WER 2.8% | Pass 95.2%
2.60 / 3.71
raw 2.74 / 3.91
WER 3.6% | Pass 94.9%
2.53 / 3.74
raw 2.70 / 3.98
WER 2.3% | Pass 93.8%
2.61 / 3.81
raw 2.78 / 4.06
WER 3.0% | Pass 93.9%
2.44 / 3.45
raw 2.65 / 3.76
WER 4.7% | Pass 91.8%
2.95 / 4.50
raw 2.95 / 4.50
WER 1.6% | Pass 100.0%
2.52 / 3.64
raw 2.77 / 4.00
WER 2.6% | Pass 90.9%
1.90 / 2.56
raw 2.09 / 2.82
WER 3.3% | Pass 90.9%
2.80 / 4.20
raw 2.80 / 4.20
WER 0.5% | Pass 100.0%
2.80 / 4.30
raw 2.80 / 4.30
WER 0.0% | Pass 100.0%
2.56 / 3.89
raw 2.56 / 3.89
WER 0.0% | Pass 100.0%
2.48 / 3.53
raw 2.62 / 3.73
WER 2.7% | Pass 94.6%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
2.00 / 3.00
raw 2.00 / 3.00
WER nan% | Pass 100.0%
2.10 / 3.10
raw 2.10 / 3.10
WER nan% | Pass 100.0%
1.80 / 2.40
raw 1.80 / 2.40
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.41 / 3.52
raw 2.52 / 3.68
WER 2.5% | Pass 95.6%
2.70 / 3.78
raw 2.76 / 3.88
WER 2.4% | Pass 97.5%
2.42 / 3.51
raw 2.50 / 3.64
WER 2.5% | Pass 96.6%
2.65 / 3.81
raw 2.65 / 3.81
WER 2.1% | Pass 100.0%
2.57 / 3.65
raw 2.58 / 3.67
WER 2.2% | Pass 99.4%
2.69 / 3.99
raw 2.74 / 4.05
WER 2.3% | Pass 98.5%
2.40 / 3.45
raw 2.62 / 3.76
WER 2.8% | Pass 91.8%
1.60 / 2.00
raw 1.82 / 2.27
WER 2.9% | Pass 88.0%
2.34 / 3.41
raw 2.60 / 3.80
WER 2.8% | Pass 89.9%
2.39 / 3.44
raw 2.78 / 4.00
WER 3.0% | Pass 86.0%
2.71 / 4.37
raw 2.89 / 4.67
WER 2.6% | Pass 93.7%
2.62 / 4.12
raw 2.82 / 4.44
WER 2.7% | Pass 92.7%
0.81 / 0.81
raw 1.00 / 1.00
WER 3.4% | Pass 81.3%
1.56 / 2.28
raw 1.86 / 2.71
WER 3.1% | Pass 84.1%
1.45 / 2.00
raw 1.60 / 2.20
WER 2.9% | Pass 90.8%
2.29 / 3.40
raw 2.50 / 3.70
WER 2.8% | Pass 91.8%
MiniMax-Speech-2.7
2.05 / 2.74
raw 2.16 / 2.89
WER 3.6% | Pass 94.8%
2.26 / 2.95
raw 2.48 / 3.25
WER 4.4% | Pass 90.9%
2.10 / 2.89
raw 2.25 / 3.10
WER 3.8% | Pass 93.1%
2.18 / 2.97
raw 2.33 / 3.16
WER 3.2% | Pass 93.9%
2.08 / 2.76
raw 2.27 / 3.00
WER 4.1% | Pass 91.8%
2.59 / 3.68
raw 2.59 / 3.68
WER 2.0% | Pass 100.0%
2.27 / 3.18
raw 2.50 / 3.50
WER 3.7% | Pass 90.9%
1.41 / 1.56
raw 1.73 / 1.91
WER 10.0% | Pass 81.8%
2.40 / 3.20
raw 2.40 / 3.20
WER 1.4% | Pass 100.0%
1.80 / 2.25
raw 2.00 / 2.50
WER 3.8% | Pass 90.0%
2.89 / 4.33
raw 2.89 / 4.33
WER 1.8% | Pass 100.0%
2.16 / 2.97
raw 2.22 / 3.05
WER 3.4% | Pass 97.3%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.53 / 1.77
raw 1.59 / 1.83
WER 2.1% | Pass 96.5%
1.87 / 2.14
raw 1.90 / 2.17
WER 1.6% | Pass 98.7%
1.46 / 1.69
raw 1.53 / 1.77
WER 2.1% | Pass 95.4%
1.73 / 2.17
raw 1.80 / 2.26
WER 2.4% | Pass 96.0%
1.69 / 1.96
raw 1.76 / 2.04
WER 2.3% | Pass 96.0%
1.82 / 2.14
raw 1.82 / 2.14
WER 1.1% | Pass 100.0%
1.43 / 1.52
raw 1.50 / 1.59
WER 2.0% | Pass 95.5%
1.64 / 1.82
raw 1.64 / 1.82
WER 1.9% | Pass 100.0%
1.26 / 1.26
raw 1.40 / 1.40
WER 2.5% | Pass 90.0%
1.20 / 1.20
raw 1.20 / 1.20
WER 0.0% | Pass 100.0%
2.00 / 2.33
raw 2.00 / 2.33
WER 0.0% | Pass 100.0%
1.32 / 1.56
raw 1.43 / 1.70
WER 3.1% | Pass 91.9%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.32 / 1.39
raw 1.63 / 1.71
WER 10.7% | Pass 81.3%
1.49 / 1.52
raw 1.88 / 1.92
WER 9.6% | Pass 79.2%
1.25 / 1.32
raw 1.64 / 1.74
WER 11.3% | Pass 76.2%
1.51 / 1.61
raw 2.04 / 2.18
WER 12.7% | Pass 74.0%
1.27 / 1.34
raw 1.82 / 1.92
WER 12.9% | Pass 70.0%
1.91 / 2.04
raw 2.00 / 2.14
WER 4.0% | Pass 95.5%
0.98 / 0.98
raw 1.48 / 1.48
WER 11.6% | Pass 66.7%
1.18 / 1.18
raw 1.18 / 1.18
WER 8.5% | Pass 100.0%
1.30 / 1.30
raw 1.30 / 1.30
WER 6.1% | Pass 100.0%
0.84 / 0.84
raw 1.20 / 1.20
WER 14.7% | Pass 70.0%
0.86 / 0.93
raw 1.56 / 1.67
WER 13.5% | Pass 55.6%
1.16 / 1.18
raw 1.30 / 1.32
WER 8.9% | Pass 89.2%
1.40 / 1.40
raw 1.40 / 1.40
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.40 / 1.60
raw 1.40 / 1.60
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.61 / 0.67
raw 1.87 / 2.04
WER 41.7% | Pass 32.7%
0.48 / 0.50
raw 2.26 / 2.37
WER 45.4% | Pass 21.1%
0.42 / 0.48
raw 1.96 / 2.22
WER 41.9% | Pass 21.5%
0.50 / 0.56
raw 2.10 / 2.32
WER 51.2% | Pass 24.0%
0.45 / 0.50
raw 2.26 / 2.48
WER 41.8% | Pass 20.0%
0.37 / 0.43
raw 2.45 / 2.90
WER 36.4% | Pass 15.0%
0.46 / 0.50
raw 2.05 / 2.18
WER 42.4% | Pass 22.7%
0.50 / 0.50
raw 1.82 / 1.82
WER 32.7% | Pass 27.3%
0.12 / 0.12
raw 1.20 / 1.20
WER 39.1% | Pass 10.0%
0.42 / 0.45
raw 1.40 / 1.50
WER 40.9% | Pass 30.0%
0.30 / 0.32
raw 1.33 / 1.44
WER 38.3% | Pass 22.2%
0.39 / 0.41
raw 1.42 / 1.47
WER 35.5% | Pass 27.8%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.57 / 2.86
raw 2.57 / 2.86
WER nan% | Pass 100.0%
1.40 / 1.60
raw 1.40 / 1.60
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.27 / 0.28
raw 1.72 / 1.80
WER 74.1% | Pass 15.8%
0.08 / 0.08
raw 2.09 / 2.13
WER 67.5% | Pass 3.9%
0.03 / 0.03
raw 1.71 / 1.81
WER 69.3% | Pass 1.6%
0.12 / 0.13
raw 1.96 / 2.06
WER 67.0% | Pass 6.1%
0.07 / 0.08
raw 1.86 / 1.88
WER 69.8% | Pass 4.0%
0.00 / 0.00
raw 2.18 / 2.36
WER 82.4% | Pass 0.0%
0.00 / 0.00
raw 1.95 / 2.00
WER 72.3% | Pass 0.0%
0.00 / 0.00
raw 1.45 / 1.55
WER 53.4% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 56.9% | Pass 0.0%
0.00 / 0.00
raw 1.40 / 1.40
WER 64.5% | Pass 0.0%
0.00 / 0.00
raw 1.22 / 1.33
WER 73.8% | Pass 0.0%
0.04 / 0.05
raw 1.41 / 1.47
WER 103.1% | Pass 3.1%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
1.71 / 1.71
raw 1.71 / 1.71
WER nan% | Pass 100.0%
2.00 / 2.30
raw 2.00 / 2.30
WER nan% | Pass 100.0%
1.30 / 1.30
raw 1.30 / 1.30
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.25 / 0.26
raw 1.40 / 1.42
WER 53.4% | Pass 18.1%
0.13 / 0.13
raw 1.71 / 1.73
WER 55.1% | Pass 7.8%
0.11 / 0.11
raw 1.38 / 1.41
WER 53.1% | Pass 7.9%
0.07 / 0.07
raw 1.68 / 1.74
WER 62.1% | Pass 4.0%
0.20 / 0.20
raw 1.65 / 1.67
WER 48.1% | Pass 12.2%
0.24 / 0.24
raw 1.73 / 1.73
WER 52.8% | Pass 13.6%
0.00 / 0.00
raw 1.36 / 1.36
WER 58.0% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 46.0% | Pass 0.0%
0.14 / 0.14
raw 1.00 / 1.00
WER 44.7% | Pass 14.3%
0.12 / 0.12
raw 1.20 / 1.20
WER 49.3% | Pass 10.0%
0.00 / 0.00
raw 1.11 / 1.11
WER 50.3% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 52.1% | Pass 0.0%
1.33 / 1.33
raw 1.33 / 1.33
WER nan% | Pass 100.0%
1.43 / 1.43
raw 1.43 / 1.43
WER nan% | Pass 100.0%
1.22 / 1.22
raw 1.22 / 1.22
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.51 / 3.70
raw 2.63 / 3.89
WER 1.9% | Pass 95.3%
2.40 / 3.36
raw 2.53 / 3.54
WER 1.8% | Pass 94.9%
2.54 / 3.80
raw 2.75 / 4.12
WER 2.2% | Pass 92.2%
2.45 / 3.49
raw 2.61 / 3.71
WER 1.7% | Pass 93.9%
2.69 / 3.98
raw 2.74 / 4.06
WER 1.0% | Pass 98.0%
2.52 / 3.72
raw 2.77 / 4.09
WER 3.5% | Pass 90.9%
2.32 / 3.42
raw 2.68 / 3.95
WER 3.2% | Pass 86.4%
3.00 / 5.00
raw 3.00 / 5.00
WER 1.9% | Pass 100.0%
2.34 / 3.24
raw 2.60 / 3.60
WER 3.3% | Pass 90.0%
2.16 / 2.88
raw 2.40 / 3.20
WER 1.9% | Pass 90.0%
2.78 / 4.11
raw 2.78 / 4.11
WER 1.1% | Pass 100.0%
2.47 / 3.68
raw 2.54 / 3.78
WER 2.4% | Pass 97.3%
1.50 / 1.83
raw 1.50 / 1.83
WER nan% | Pass 100.0%
3.00 / 4.71
raw 3.00 / 4.71
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
2.30 / 3.50
raw 2.30 / 3.50
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.42 / 3.49
raw 2.56 / 3.67
WER 1.9% | Pass 94.9%
2.45 / 3.40
raw 2.62 / 3.63
WER 1.6% | Pass 93.6%
2.41 / 3.49
raw 2.59 / 3.75
WER 1.6% | Pass 93.1%
2.42 / 3.53
raw 2.52 / 3.68
WER 1.6% | Pass 96.0%
2.53 / 3.43
raw 2.58 / 3.50
WER 1.0% | Pass 98.0%
2.43 / 3.53
raw 2.82 / 4.09
WER 4.0% | Pass 86.4%
2.65 / 3.86
raw 2.77 / 4.05
WER 1.7% | Pass 95.5%
2.73 / 4.18
raw 2.73 / 4.18
WER 1.1% | Pass 100.0%
2.07 / 2.61
raw 2.30 / 2.90
WER 3.4% | Pass 90.0%
1.76 / 2.48
raw 2.20 / 3.10
WER 3.4% | Pass 80.0%
2.37 / 3.56
raw 2.67 / 4.00
WER 3.0% | Pass 88.9%
2.20 / 2.99
raw 2.32 / 3.16
WER 2.4% | Pass 94.6%
1.67 / 2.00
raw 1.67 / 2.00
WER nan% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER nan% | Pass 100.0%
2.80 / 4.60
raw 2.80 / 4.60
WER nan% | Pass 100.0%
2.90 / 4.70
raw 2.90 / 4.70
WER nan% | Pass 100.0%
Qwen3TTS-12Hz-1.7B-VD
2.29 / 3.29
raw 2.53 / 3.63
WER 2.5% | Pass 90.4%
2.45 / 3.40
raw 2.70 / 3.75
WER 2.2% | Pass 90.8%
2.30 / 3.38
raw 2.60 / 3.82
WER 2.6% | Pass 88.5%
2.28 / 3.20
raw 2.67 / 3.75
WER 2.5% | Pass 85.4%
2.70 / 3.98
raw 2.76 / 4.06
WER 0.6% | Pass 98.0%
2.39 / 3.49
raw 2.77 / 4.05
WER 4.1% | Pass 86.4%
2.32 / 3.30
raw 2.68 / 3.82
WER 2.4% | Pass 86.4%
1.90 / 2.64
raw 2.09 / 2.91
WER 5.7% | Pass 90.9%
1.68 / 2.32
raw 2.10 / 2.90
WER 4.4% | Pass 80.0%
2.00 / 3.04
raw 2.50 / 3.80
WER 2.6% | Pass 80.0%
1.64 / 2.07
raw 2.11 / 2.67
WER 3.7% | Pass 77.8%
2.52 / 3.69
raw 2.75 / 4.03
WER 2.6% | Pass 91.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
2.86 / 4.57
raw 2.86 / 4.57
WER nan% | Pass 100.0%
1.60 / 2.00
raw 1.60 / 2.00
WER nan% | Pass 100.0%
2.00 / 3.00
raw 2.00 / 3.00
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.34 / 3.25
raw 2.44 / 3.40
WER 1.7% | Pass 95.8%
2.58 / 3.46
raw 2.64 / 3.54
WER 1.6% | Pass 97.7%
2.35 / 3.35
raw 2.43 / 3.46
WER 1.7% | Pass 96.8%
2.60 / 3.62
raw 2.60 / 3.62
WER 1.4% | Pass 100.0%
2.50 / 3.31
raw 2.51 / 3.33
WER 1.5% | Pass 99.6%
2.52 / 3.45
raw 2.55 / 3.50
WER 1.6% | Pass 98.7%
2.71 / 4.23
raw 2.95 / 4.60
WER 1.9% | Pass 92.0%
1.60 / 2.08
raw 1.82 / 2.36
WER 2.0% | Pass 88.1%
1.98 / 2.52
raw 2.20 / 2.80
WER 1.9% | Pass 90.1%
1.72 / 2.20
raw 2.00 / 2.56
WER 2.0% | Pass 86.2%
2.40 / 3.55
raw 2.56 / 3.78
WER 1.8% | Pass 93.9%
2.62 / 3.85
raw 2.82 / 4.15
WER 1.8% | Pass 92.9%
0.81 / 0.81
raw 1.00 / 1.00
WER 2.3% | Pass 81.4%
1.81 / 2.65
raw 2.14 / 3.14
WER 2.1% | Pass 84.3%
1.18 / 1.37
raw 1.30 / 1.50
WER 2.0% | Pass 91.0%
2.02 / 2.94
raw 2.20 / 3.20
WER 1.9% | Pass 92.0%
MiniMax-Speech-2.7
2.10 / 2.87
raw 2.18 / 2.98
WER 1.6% | Pass 96.5%
2.32 / 3.16
raw 2.41 / 3.28
WER 1.5% | Pass 96.2%
2.15 / 3.03
raw 2.28 / 3.21
WER 1.6% | Pass 94.6%
2.16 / 2.95
raw 2.30 / 3.14
WER 1.4% | Pass 94.0%
2.30 / 3.04
raw 2.30 / 3.04
WER 0.3% | Pass 100.0%
2.26 / 3.17
raw 2.36 / 3.32
WER 1.7% | Pass 95.5%
2.52 / 3.77
raw 2.64 / 3.95
WER 2.0% | Pass 95.5%
1.55 / 1.73
raw 1.55 / 1.73
WER 1.9% | Pass 100.0%
2.70 / 4.10
raw 2.70 / 4.10
WER 2.3% | Pass 100.0%
1.84 / 2.56
raw 2.30 / 3.20
WER 3.4% | Pass 80.0%
2.17 / 3.16
raw 2.44 / 3.56
WER 2.4% | Pass 88.9%
2.24 / 3.10
raw 2.30 / 3.19
WER 2.6% | Pass 97.3%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.57 / 2.14
raw 1.57 / 2.14
WER nan% | Pass 100.0%
1.20 / 1.40
raw 1.20 / 1.40
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.49 / 1.68
raw 1.60 / 1.79
WER 2.1% | Pass 93.7%
1.62 / 1.76
raw 1.73 / 1.88
WER 1.8% | Pass 93.6%
1.51 / 1.67
raw 1.63 / 1.80
WER 1.7% | Pass 93.0%
1.54 / 1.73
raw 1.60 / 1.80
WER 1.1% | Pass 96.0%
1.80 / 2.11
raw 1.88 / 2.20
WER 1.1% | Pass 95.9%
1.73 / 1.96
raw 2.00 / 2.27
WER 3.8% | Pass 86.4%
1.49 / 1.57
raw 1.64 / 1.73
WER 2.1% | Pass 90.9%
1.27 / 1.27
raw 1.27 / 1.27
WER 2.1% | Pass 100.0%
1.53 / 1.53
raw 1.70 / 1.70
WER 2.7% | Pass 90.0%
1.08 / 1.08
raw 1.20 / 1.20
WER 1.1% | Pass 90.0%
1.38 / 1.73
raw 1.78 / 2.22
WER 3.7% | Pass 77.8%
1.17 / 1.22
raw 1.27 / 1.32
WER 3.2% | Pass 91.9%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.67 / 2.33
raw 1.67 / 2.33
WER nan% | Pass 100.0%
1.40 / 1.60
raw 1.40 / 1.60
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.28 / 1.30
raw 1.53 / 1.55
WER 5.5% | Pass 83.8%
1.62 / 1.65
raw 1.95 / 1.97
WER 5.2% | Pass 83.3%
1.28 / 1.32
raw 1.53 / 1.57
WER 4.5% | Pass 83.7%
1.71 / 1.76
raw 1.90 / 1.96
WER 2.8% | Pass 90.0%
1.22 / 1.23
raw 1.61 / 1.63
WER 7.6% | Pass 75.5%
1.45 / 1.52
raw 2.00 / 2.09
WER 7.5% | Pass 72.7%
1.49 / 1.49
raw 1.64 / 1.64
WER 3.7% | Pass 90.9%
1.19 / 1.19
raw 1.45 / 1.45
WER 5.1% | Pass 81.8%
0.96 / 0.96
raw 1.20 / 1.20
WER 4.7% | Pass 80.0%
0.99 / 0.99
raw 1.10 / 1.10
WER 1.5% | Pass 90.0%
0.95 / 0.95
raw 1.22 / 1.22
WER 6.4% | Pass 77.8%
0.96 / 0.96
raw 1.27 / 1.27
WER 7.4% | Pass 75.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.14 / 1.14
raw 1.14 / 1.14
WER nan% | Pass 100.0%
1.10 / 1.10
raw 1.10 / 1.10
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
1.09 / 1.20
raw 1.83 / 2.01
WER 15.5% | Pass 59.4%
1.07 / 1.11
raw 2.00 / 2.07
WER 16.1% | Pass 53.5%
1.00 / 1.09
raw 1.82 / 1.98
WER 15.8% | Pass 55.0%
0.84 / 0.93
raw 2.04 / 2.26
WER 18.9% | Pass 41.3%
1.09 / 1.18
raw 1.95 / 2.12
WER 16.6% | Pass 55.8%
0.84 / 0.86
raw 2.00 / 2.05
WER 18.9% | Pass 42.1%
1.26 / 1.30
raw 1.64 / 1.68
WER 7.5% | Pass 77.3%
0.80 / 0.90
raw 1.60 / 1.80
WER 19.9% | Pass 50.0%
0.90 / 0.95
raw 1.80 / 1.90
WER 19.1% | Pass 50.0%
1.40 / 1.47
raw 2.00 / 2.10
WER 12.9% | Pass 70.0%
0.59 / 0.59
raw 1.33 / 1.33
WER 17.8% | Pass 44.4%
0.92 / 1.01
raw 1.70 / 1.86
WER 11.0% | Pass 54.1%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
2.57 / 4.00
raw 2.57 / 4.00
WER nan% | Pass 100.0%
1.80 / 2.30
raw 1.80 / 2.30
WER nan% | Pass 100.0%
1.10 / 1.10
raw 1.10 / 1.10
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.34 / 0.35
raw 1.57 / 1.61
WER 44.2% | Pass 21.7%
0.27 / 0.27
raw 1.92 / 1.95
WER 46.9% | Pass 14.1%
0.14 / 0.14
raw 1.52 / 1.57
WER 46.7% | Pass 9.2%
0.27 / 0.28
raw 1.70 / 1.72
WER 48.8% | Pass 16.0%
0.27 / 0.28
raw 1.96 / 2.02
WER 44.5% | Pass 14.0%
0.15 / 0.16
raw 1.68 / 1.73
WER 46.5% | Pass 9.1%
0.00 / 0.00
raw 1.73 / 1.77
WER 44.8% | Pass 0.0%
0.25 / 0.25
raw 1.36 / 1.36
WER 30.6% | Pass 18.2%
0.10 / 0.10
raw 1.00 / 1.00
WER 47.8% | Pass 10.0%
0.12 / 0.12
raw 1.20 / 1.20
WER 60.8% | Pass 10.0%
0.00 / 0.00
raw 1.22 / 1.22
WER 30.4% | Pass 0.0%
0.04 / 0.04
raw 1.35 / 1.41
WER 37.6% | Pass 2.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
2.43 / 2.57
raw 2.43 / 2.57
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.30 / 0.31
raw 1.40 / 1.42
WER 39.7% | Pass 21.6%
0.28 / 0.28
raw 1.79 / 1.82
WER 36.3% | Pass 15.6%
0.11 / 0.11
raw 1.43 / 1.46
WER 43.6% | Pass 7.9%
0.21 / 0.22
raw 1.76 / 1.78
WER 42.5% | Pass 12.2%
0.18 / 0.18
raw 1.48 / 1.48
WER 38.9% | Pass 12.5%
0.27 / 0.29
raw 1.90 / 2.05
WER 47.9% | Pass 14.3%
0.14 / 0.14
raw 1.59 / 1.59
WER 35.7% | Pass 9.1%
0.00 / 0.00
raw 1.36 / 1.36
WER 34.3% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 45.8% | Pass 0.0%
0.10 / 0.10
raw 1.00 / 1.00
WER 46.6% | Pass 10.0%
0.11 / 0.11
raw 1.00 / 1.00
WER 31.2% | Pass 11.1%
0.06 / 0.06
raw 1.05 / 1.08
WER 35.2% | Pass 5.4%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.29
raw 1.29 / 1.29
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.46 / 3.70
raw 2.66 / 4.00
WER 4.3% | Pass 92.5%
2.45 / 3.57
raw 2.65 / 3.86
WER 4.1% | Pass 92.5%
2.50 / 3.83
raw 2.76 / 4.21
WER 4.2% | Pass 90.8%
2.39 / 3.53
raw 2.66 / 3.92
WER 5.3% | Pass 90.0%
2.61 / 3.88
raw 2.84 / 4.22
WER 3.8% | Pass 92.0%
2.27 / 3.43
raw 2.81 / 4.24
WER 5.6% | Pass 81.0%
2.91 / 4.73
raw 2.91 / 4.73
WER 2.1% | Pass 100.0%
2.23 / 3.06
raw 2.45 / 3.36
WER 3.2% | Pass 90.9%
2.60 / 3.70
raw 2.60 / 3.70
WER 3.7% | Pass 100.0%
1.92 / 2.72
raw 2.40 / 3.40
WER 8.0% | Pass 80.0%
2.89 / 4.56
raw 2.89 / 4.56
WER 1.7% | Pass 100.0%
2.26 / 3.43
raw 2.46 / 3.73
WER 4.5% | Pass 91.9%
1.67 / 2.33
raw 1.67 / 2.33
WER nan% | Pass 100.0%
3.00 / 5.00
raw 3.00 / 5.00
WER nan% | Pass 100.0%
2.40 / 3.80
raw 2.40 / 3.80
WER nan% | Pass 100.0%
2.50 / 3.90
raw 2.50 / 3.90
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.51 / 3.64
raw 2.65 / 3.86
WER 4.1% | Pass 94.4%
2.50 / 3.52
raw 2.68 / 3.77
WER 4.4% | Pass 93.5%
2.58 / 3.76
raw 2.72 / 3.98
WER 3.7% | Pass 94.6%
2.40 / 3.30
raw 2.67 / 3.67
WER 5.6% | Pass 89.8%
2.60 / 3.69
raw 2.71 / 3.85
WER 2.6% | Pass 95.8%
2.41 / 3.43
raw 2.81 / 4.00
WER 4.9% | Pass 85.7%
2.86 / 4.27
raw 2.86 / 4.27
WER 2.7% | Pass 100.0%
2.73 / 4.18
raw 2.73 / 4.18
WER 1.4% | Pass 100.0%
2.56 / 3.89
raw 2.56 / 3.89
WER 2.5% | Pass 100.0%
2.25 / 3.33
raw 2.50 / 3.70
WER 6.5% | Pass 90.0%
2.78 / 4.11
raw 2.78 / 4.11
WER 3.5% | Pass 100.0%
2.38 / 3.53
raw 2.59 / 3.84
WER 4.8% | Pass 91.9%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
3.00 / 5.00
raw 3.00 / 5.00
WER nan% | Pass 100.0%
2.50 / 3.70
raw 2.50 / 3.70
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.30 / 3.28
raw 2.46 / 3.51
WER 4.1% | Pass 93.6%
2.58 / 3.60
raw 2.70 / 3.77
WER 3.9% | Pass 95.5%
2.34 / 3.34
raw 2.47 / 3.53
WER 4.0% | Pass 94.5%
2.59 / 3.66
raw 2.64 / 3.72
WER 3.5% | Pass 98.3%
2.38 / 3.26
raw 2.45 / 3.35
WER 3.6% | Pass 97.3%
2.55 / 3.81
raw 2.65 / 3.95
WER 3.8% | Pass 96.4%
2.46 / 3.55
raw 2.74 / 3.95
WER 4.5% | Pass 89.9%
1.81 / 2.50
raw 2.10 / 2.90
WER 4.7% | Pass 86.1%
2.11 / 2.82
raw 2.40 / 3.20
WER 4.6% | Pass 88.0%
1.78 / 2.15
raw 2.11 / 2.56
WER 4.9% | Pass 84.2%
2.75 / 4.48
raw 3.00 / 4.89
WER 4.3% | Pass 91.7%
2.46 / 3.78
raw 2.71 / 4.16
WER 4.4% | Pass 90.8%
0.80 / 0.80
raw 1.00 / 1.00
WER 5.5% | Pass 79.6%
1.51 / 2.06
raw 1.83 / 2.50
WER 5.1% | Pass 82.4%
0.98 / 0.98
raw 1.10 / 1.10
WER 4.7% | Pass 88.9%
2.52 / 4.13
raw 2.80 / 4.60
WER 4.5% | Pass 89.9%
Qwen3TTS-12Hz-1.7B-VD
2.25 / 3.17
raw 2.49 / 3.50
WER 6.7% | Pass 90.5%
2.36 / 3.14
raw 2.74 / 3.64
WER 8.2% | Pass 86.2%
2.28 / 3.35
raw 2.55 / 3.75
WER 6.3% | Pass 89.4%
2.20 / 3.08
raw 2.56 / 3.58
WER 9.9% | Pass 86.0%
2.27 / 3.29
raw 2.70 / 3.92
WER 9.1% | Pass 84.0%
2.23 / 3.24
raw 2.73 / 3.95
WER 9.3% | Pass 81.8%
2.78 / 4.04
raw 2.91 / 4.23
WER 2.7% | Pass 95.5%
1.91 / 2.27
raw 1.91 / 2.27
WER 0.3% | Pass 100.0%
1.98 / 2.70
raw 2.20 / 3.00
WER 5.2% | Pass 90.0%
2.25 / 3.06
raw 2.50 / 3.40
WER 4.0% | Pass 90.0%
2.89 / 4.11
raw 2.89 / 4.11
WER 1.9% | Pass 100.0%
2.39 / 3.28
raw 2.53 / 3.47
WER 4.0% | Pass 94.4%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER nan% | Pass 100.0%
1.60 / 1.90
raw 1.60 / 1.90
WER nan% | Pass 100.0%
1.67 / 2.33
raw 1.67 / 2.33
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
2.04 / 2.73
raw 2.20 / 2.94
WER 4.2% | Pass 92.8%
2.24 / 2.86
raw 2.49 / 3.18
WER 4.8% | Pass 89.9%
2.15 / 2.93
raw 2.29 / 3.12
WER 3.5% | Pass 93.8%
2.04 / 2.62
raw 2.22 / 2.86
WER 4.5% | Pass 91.8%
2.18 / 2.93
raw 2.43 / 3.27
WER 4.2% | Pass 89.8%
2.12 / 2.83
raw 2.59 / 3.45
WER 6.1% | Pass 81.8%
2.59 / 3.64
raw 2.59 / 3.64
WER 2.1% | Pass 100.0%
1.82 / 2.18
raw 1.82 / 2.18
WER 2.2% | Pass 100.0%
2.00 / 2.50
raw 2.00 / 2.50
WER 4.6% | Pass 100.0%
2.25 / 3.15
raw 2.50 / 3.50
WER 4.3% | Pass 90.0%
2.07 / 2.86
raw 2.33 / 3.22
WER 3.2% | Pass 88.9%
2.24 / 3.35
raw 2.43 / 3.65
WER 4.5% | Pass 91.9%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.14 / 1.14
raw 1.14 / 1.14
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.59 / 1.77
raw 1.67 / 1.86
WER 3.4% | Pass 94.9%
1.71 / 1.83
raw 1.80 / 1.93
WER 3.4% | Pass 95.0%
1.64 / 1.87
raw 1.74 / 1.99
WER 3.2% | Pass 93.9%
1.71 / 1.97
raw 1.82 / 2.10
WER 3.7% | Pass 94.0%
1.69 / 1.91
raw 1.84 / 2.08
WER 3.3% | Pass 92.0%
1.53 / 1.77
raw 1.77 / 2.05
WER 4.7% | Pass 86.4%
1.82 / 2.00
raw 1.82 / 2.00
WER 2.8% | Pass 100.0%
2.18 / 2.27
raw 2.18 / 2.27
WER 1.0% | Pass 100.0%
1.60 / 1.60
raw 1.60 / 1.60
WER 2.6% | Pass 100.0%
0.99 / 0.99
raw 1.10 / 1.10
WER 4.3% | Pass 90.0%
1.62 / 1.62
raw 1.62 / 1.62
WER 2.0% | Pass 100.0%
1.38 / 1.51
raw 1.46 / 1.59
WER 4.0% | Pass 94.6%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.57 / 2.00
raw 1.57 / 2.00
WER nan% | Pass 100.0%
1.30 / 1.40
raw 1.30 / 1.40
WER nan% | Pass 100.0%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.25 / 1.26
raw 1.51 / 1.52
WER 10.7% | Pass 82.9%
1.53 / 1.53
raw 1.77 / 1.77
WER 9.8% | Pass 86.1%
1.26 / 1.28
raw 1.56 / 1.58
WER 11.0% | Pass 80.9%
1.30 / 1.34
raw 1.82 / 1.88
WER 16.5% | Pass 71.4%
1.34 / 1.34
raw 1.68 / 1.68
WER 8.3% | Pass 80.0%
1.33 / 1.33
raw 1.73 / 1.73
WER 11.9% | Pass 77.3%
1.45 / 1.45
raw 1.59 / 1.59
WER 8.3% | Pass 90.9%
1.49 / 1.49
raw 1.64 / 1.64
WER 10.1% | Pass 90.9%
0.99 / 0.99
raw 1.11 / 1.11
WER 10.0% | Pass 88.9%
1.17 / 1.17
raw 1.30 / 1.30
WER 7.4% | Pass 90.0%
0.86 / 0.86
raw 1.11 / 1.11
WER 11.5% | Pass 77.8%
1.03 / 1.03
raw 1.27 / 1.27
WER 8.2% | Pass 81.1%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.56 / 0.60
raw 1.80 / 1.92
WER 54.3% | Pass 31.0%
0.44 / 0.44
raw 2.13 / 2.15
WER 66.2% | Pass 20.5%
0.44 / 0.48
raw 1.80 / 1.94
WER 62.5% | Pass 24.6%
0.48 / 0.51
raw 2.31 / 2.46
WER 51.4% | Pass 20.8%
0.55 / 0.58
raw 1.93 / 2.07
WER 41.2% | Pass 28.3%
0.63 / 0.65
raw 2.19 / 2.29
WER 50.8% | Pass 28.6%
0.36 / 0.36
raw 1.95 / 2.00
WER 113.2% | Pass 18.2%
0.00 / 0.00
raw 1.11 / 1.11
WER 54.6% | Pass 0.0%
0.24 / 0.26
raw 1.20 / 1.30
WER 49.2% | Pass 20.0%
0.00 / 0.00
raw 1.44 / 1.44
WER 65.5% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 44.0% | Pass 0.0%
0.27 / 0.28
raw 1.43 / 1.49
WER 41.9% | Pass 18.9%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
2.71 / 4.14
raw 2.71 / 4.14
WER nan% | Pass 100.0%
1.40 / 1.40
raw 1.40 / 1.40
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.38 / 0.39
raw 1.43 / 1.45
WER 44.6% | Pass 26.8%
0.37 / 0.37
raw 1.76 / 1.76
WER 41.5% | Pass 21.2%
0.27 / 0.27
raw 1.45 / 1.48
WER 46.5% | Pass 18.5%
0.42 / 0.43
raw 1.69 / 1.73
WER 41.5% | Pass 25.0%
0.17 / 0.17
raw 1.66 / 1.70
WER 50.2% | Pass 10.0%
0.18 / 0.18
raw 1.95 / 1.95
WER 50.5% | Pass 9.1%
0.27 / 0.27
raw 1.50 / 1.50
WER 45.0% | Pass 18.2%
0.10 / 0.10
raw 1.09 / 1.09
WER 41.1% | Pass 9.1%
0.20 / 0.20
raw 1.00 / 1.00
WER 41.6% | Pass 20.0%
0.48 / 0.48
raw 1.20 / 1.20
WER 32.5% | Pass 40.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 42.1% | Pass 0.0%
0.14 / 0.14
raw 1.03 / 1.03
WER 44.8% | Pass 13.9%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.71 / 1.71
raw 1.71 / 1.71
WER nan% | Pass 100.0%
1.11 / 1.11
raw 1.11 / 1.11
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.31 / 0.32
raw 1.69 / 1.74
WER 75.6% | Pass 18.2%
0.15 / 0.15
raw 1.94 / 1.98
WER 86.0% | Pass 7.5%
0.17 / 0.18
raw 1.62 / 1.67
WER 66.9% | Pass 10.6%
0.12 / 0.13
raw 2.08 / 2.18
WER 66.0% | Pass 6.0%
0.14 / 0.14
raw 1.76 / 1.80
WER 95.1% | Pass 8.0%
0.27 / 0.28
raw 2.00 / 2.05
WER 62.5% | Pass 13.6%
0.16 / 0.16
raw 1.73 / 1.77
WER 66.5% | Pass 9.1%
0.00 / 0.00
raw 1.55 / 1.55
WER 65.0% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 86.5% | Pass 0.0%
0.00 / 0.00
raw 1.30 / 1.30
WER 76.0% | Pass 0.0%
0.00 / 0.00
raw 1.22 / 1.33
WER 65.4% | Pass 0.0%
0.04 / 0.04
raw 1.33 / 1.36
WER 78.0% | Pass 2.8%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
2.43 / 2.86
raw 2.43 / 2.86
WER nan% | Pass 100.0%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
1.30 / 1.30
raw 1.30 / 1.30
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.49 / 3.72
raw 2.64 / 3.94
WER 2.6% | Pass 94.4%
2.41 / 3.47
raw 2.65 / 3.82
WER 3.3% | Pass 90.9%
2.62 / 4.02
raw 2.77 / 4.25
WER 2.1% | Pass 94.5%
2.56 / 3.90
raw 2.85 / 4.35
WER 3.0% | Pass 89.6%
2.38 / 3.57
raw 2.70 / 4.06
WER 4.4% | Pass 88.0%
2.64 / 4.00
raw 2.64 / 4.00
WER 0.9% | Pass 100.0%
2.65 / 3.99
raw 2.77 / 4.18
WER 1.9% | Pass 95.5%
2.91 / 4.55
raw 2.91 / 4.55
WER 1.8% | Pass 100.0%
2.40 / 3.30
raw 2.40 / 3.30
WER 0.8% | Pass 100.0%
2.25 / 3.15
raw 2.50 / 3.50
WER 3.0% | Pass 90.0%
2.89 / 4.44
raw 2.89 / 4.44
WER 1.1% | Pass 100.0%
2.39 / 3.47
raw 2.46 / 3.57
WER 1.8% | Pass 97.1%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
3.00 / 4.67
raw 3.00 / 4.67
WER nan% | Pass 100.0%
2.22 / 3.11
raw 2.22 / 3.11
WER nan% | Pass 100.0%
2.50 / 3.70
raw 2.50 / 3.70
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.42 / 3.51
raw 2.56 / 3.70
WER 2.3% | Pass 94.9%
2.47 / 3.48
raw 2.64 / 3.72
WER 2.7% | Pass 93.6%
2.53 / 3.76
raw 2.67 / 3.98
WER 1.8% | Pass 94.5%
2.41 / 3.48
raw 2.62 / 3.78
WER 2.4% | Pass 92.0%
2.36 / 3.50
raw 2.68 / 3.98
WER 4.1% | Pass 88.0%
2.86 / 4.09
raw 2.86 / 4.09
WER 0.9% | Pass 100.0%
2.64 / 3.91
raw 2.64 / 3.91
WER 2.5% | Pass 100.0%
2.56 / 3.97
raw 2.82 / 4.36
WER 2.8% | Pass 90.9%
2.30 / 3.10
raw 2.30 / 3.10
WER 0.7% | Pass 100.0%
2.25 / 3.24
raw 2.50 / 3.60
WER 2.6% | Pass 90.0%
2.56 / 3.44
raw 2.56 / 3.44
WER 2.5% | Pass 100.0%
2.24 / 3.10
raw 2.30 / 3.19
WER 1.8% | Pass 97.3%
1.83 / 2.17
raw 1.83 / 2.17
WER nan% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER nan% | Pass 100.0%
2.30 / 3.30
raw 2.30 / 3.30
WER nan% | Pass 100.0%
2.30 / 3.50
raw 2.30 / 3.50
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.20 / 3.00
raw 2.29 / 3.13
WER 2.2% | Pass 95.9%
2.60 / 3.52
raw 2.66 / 3.60
WER 2.1% | Pass 97.8%
2.13 / 2.86
raw 2.20 / 2.95
WER 2.2% | Pass 96.9%
2.35 / 3.16
raw 2.35 / 3.16
WER 1.9% | Pass 100.0%
2.35 / 3.14
raw 2.35 / 3.15
WER 1.9% | Pass 99.7%
2.67 / 3.90
raw 2.70 / 3.95
WER 2.0% | Pass 98.8%
2.42 / 3.45
raw 2.62 / 3.75
WER 2.4% | Pass 92.1%
1.52 / 1.76
raw 1.73 / 2.00
WER 2.5% | Pass 88.2%
1.71 / 1.89
raw 1.90 / 2.10
WER 2.5% | Pass 90.1%
1.82 / 2.21
raw 2.11 / 2.56
WER 2.6% | Pass 86.3%
2.35 / 3.41
raw 2.50 / 3.62
WER 2.3% | Pass 94.0%
2.44 / 3.61
raw 2.62 / 3.88
WER 2.4% | Pass 93.0%
0.82 / 0.82
raw 1.00 / 1.00
WER 3.0% | Pass 81.5%
1.57 / 2.29
raw 1.86 / 2.71
WER 2.7% | Pass 84.4%
1.28 / 1.64
raw 1.40 / 1.80
WER 2.5% | Pass 91.1%
2.03 / 2.76
raw 2.20 / 3.00
WER 2.4% | Pass 92.1%
Qwen3TTS-12Hz-1.7B-VD
2.23 / 2.94
raw 2.43 / 3.19
WER 3.3% | Pass 92.1%
2.44 / 3.18
raw 2.72 / 3.54
WER 3.3% | Pass 89.7%
2.33 / 3.13
raw 2.53 / 3.40
WER 2.9% | Pass 92.2%
2.13 / 2.69
raw 2.54 / 3.20
WER 4.5% | Pass 84.0%
2.36 / 3.22
raw 2.74 / 3.74
WER 4.4% | Pass 86.0%
2.95 / 4.14
raw 2.95 / 4.14
WER 1.0% | Pass 100.0%
2.73 / 3.91
raw 2.73 / 3.91
WER 1.0% | Pass 100.0%
1.82 / 2.23
raw 2.00 / 2.45
WER 4.2% | Pass 90.9%
2.10 / 2.50
raw 2.10 / 2.50
WER 0.8% | Pass 100.0%
1.71 / 1.98
raw 1.90 / 2.20
WER 3.6% | Pass 90.0%
2.11 / 2.44
raw 2.11 / 2.44
WER 1.1% | Pass 100.0%
2.28 / 3.03
raw 2.49 / 3.30
WER 3.7% | Pass 91.9%
1.67 / 2.00
raw 1.67 / 2.00
WER nan% | Pass 100.0%
2.43 / 3.86
raw 2.43 / 3.86
WER nan% | Pass 100.0%
1.40 / 1.60
raw 1.40 / 1.60
WER nan% | Pass 100.0%
1.30 / 1.30
raw 1.30 / 1.30
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
2.02 / 2.71
raw 2.08 / 2.80
WER 2.1% | Pass 96.8%
2.28 / 3.05
raw 2.34 / 3.13
WER 2.2% | Pass 97.4%
2.15 / 2.96
raw 2.26 / 3.11
WER 1.8% | Pass 95.2%
2.03 / 2.72
raw 2.26 / 3.02
WER 2.9% | Pass 90.0%
2.18 / 3.12
raw 2.28 / 3.26
WER 2.9% | Pass 95.7%
2.50 / 3.59
raw 2.50 / 3.59
WER 0.9% | Pass 100.0%
2.48 / 3.43
raw 2.48 / 3.43
WER 1.0% | Pass 100.0%
1.82 / 2.18
raw 1.82 / 2.18
WER 1.7% | Pass 100.0%
1.90 / 2.10
raw 1.90 / 2.10
WER 0.4% | Pass 100.0%
1.89 / 2.43
raw 2.10 / 2.70
WER 3.0% | Pass 90.0%
2.44 / 3.33
raw 2.44 / 3.33
WER 1.1% | Pass 100.0%
1.97 / 2.65
raw 1.97 / 2.65
WER 1.7% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.40 / 1.80
raw 1.40 / 1.80
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.46 / 1.60
raw 1.54 / 1.68
WER 2.4% | Pass 94.8%
1.59 / 1.63
raw 1.72 / 1.77
WER 3.2% | Pass 92.3%
1.49 / 1.67
raw 1.56 / 1.75
WER 2.1% | Pass 95.2%
1.58 / 1.71
raw 1.68 / 1.82
WER 2.2% | Pass 94.0%
1.45 / 1.58
raw 1.61 / 1.76
WER 3.6% | Pass 89.8%
1.61 / 1.78
raw 1.68 / 1.86
WER 2.4% | Pass 95.5%
1.26 / 1.26
raw 1.32 / 1.32
WER 1.9% | Pass 95.5%
2.10 / 2.60
raw 2.10 / 2.60
WER 2.0% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER 0.0% | Pass 100.0%
1.90 / 2.10
raw 1.90 / 2.10
WER 0.0% | Pass 100.0%
1.62 / 1.88
raw 1.62 / 1.88
WER 1.9% | Pass 100.0%
1.20 / 1.25
raw 1.31 / 1.36
WER 2.8% | Pass 91.7%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
2.14 / 2.71
raw 2.14 / 2.71
WER nan% | Pass 100.0%
1.20 / 1.40
raw 1.20 / 1.40
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.19 / 1.21
raw 1.49 / 1.51
WER 6.9% | Pass 79.8%
1.34 / 1.34
raw 1.81 / 1.81
WER 7.8% | Pass 74.4%
1.21 / 1.23
raw 1.51 / 1.53
WER 5.3% | Pass 80.5%
1.28 / 1.31
raw 1.78 / 1.82
WER 6.8% | Pass 72.0%
1.22 / 1.23
raw 1.60 / 1.62
WER 7.5% | Pass 76.0%
1.81 / 1.81
raw 2.09 / 2.09
WER 5.9% | Pass 86.4%
1.15 / 1.15
raw 1.41 / 1.41
WER 4.0% | Pass 81.8%
1.19 / 1.19
raw 1.64 / 1.64
WER 8.3% | Pass 72.7%
0.90 / 0.90
raw 1.00 / 1.00
WER 3.5% | Pass 90.0%
0.84 / 0.84
raw 1.20 / 1.20
WER 10.5% | Pass 70.0%
0.74 / 0.74
raw 1.11 / 1.11
WER 11.1% | Pass 66.7%
0.99 / 1.03
raw 1.22 / 1.27
WER 7.3% | Pass 81.1%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.29
raw 1.29 / 1.29
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.82 / 0.89
raw 1.83 / 1.98
WER 25.2% | Pass 45.0%
0.89 / 0.92
raw 2.27 / 2.35
WER 27.2% | Pass 39.0%
0.77 / 0.85
raw 1.93 / 2.12
WER 22.9% | Pass 40.0%
0.88 / 0.96
raw 1.96 / 2.14
WER 22.7% | Pass 44.9%
0.74 / 0.79
raw 2.04 / 2.19
WER 29.7% | Pass 36.2%
1.07 / 1.07
raw 2.36 / 2.36
WER 22.6% | Pass 45.5%
0.67 / 0.76
raw 2.45 / 2.77
WER 25.2% | Pass 27.3%
0.53 / 0.53
raw 1.45 / 1.45
WER 28.9% | Pass 36.4%
0.48 / 0.48
raw 1.20 / 1.20
WER 14.5% | Pass 40.0%
0.85 / 0.90
raw 1.70 / 1.80
WER 18.9% | Pass 50.0%
0.11 / 0.11
raw 1.00 / 1.00
WER 32.3% | Pass 11.1%
0.41 / 0.42
raw 1.38 / 1.41
WER 25.9% | Pass 29.7%
1.40 / 1.40
raw 1.40 / 1.40
WER nan% | Pass 100.0%
2.57 / 4.00
raw 2.57 / 4.00
WER nan% | Pass 100.0%
1.70 / 2.10
raw 1.70 / 2.10
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.39 / 0.39
raw 1.46 / 1.46
WER 34.3% | Pass 26.8%
0.47 / 0.47
raw 1.90 / 1.90
WER 28.4% | Pass 24.7%
0.22 / 0.23
raw 1.43 / 1.44
WER 36.5% | Pass 15.7%
0.43 / 0.44
raw 1.80 / 1.82
WER 32.5% | Pass 24.0%
0.39 / 0.39
raw 1.67 / 1.67
WER 32.3% | Pass 23.3%
0.45 / 0.45
raw 2.00 / 2.00
WER 32.9% | Pass 22.7%
0.26 / 0.26
raw 1.41 / 1.41
WER 33.1% | Pass 18.2%
0.00 / 0.00
raw 1.09 / 1.09
WER 33.0% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 40.9% | Pass 0.0%
0.12 / 0.12
raw 1.20 / 1.20
WER 43.4% | Pass 10.0%
0.00 / 0.00
raw 1.22 / 1.22
WER 30.4% | Pass 0.0%
0.03 / 0.03
raw 1.03 / 1.03
WER 37.8% | Pass 2.7%
1.33 / 1.33
raw 1.33 / 1.33
WER nan% | Pass 100.0%
1.57 / 1.57
raw 1.57 / 1.57
WER nan% | Pass 100.0%
1.10 / 1.10
raw 1.10 / 1.10
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.34 / 0.35
raw 1.70 / 1.75
WER 47.3% | Pass 19.8%
0.18 / 0.18
raw 2.33 / 2.37
WER 49.3% | Pass 7.7%
0.10 / 0.11
raw 1.66 / 1.71
WER 49.2% | Pass 6.2%
0.12 / 0.12
raw 1.96 / 1.98
WER 48.4% | Pass 6.0%
0.12 / 0.12
raw 1.96 / 2.00
WER 53.6% | Pass 6.0%
0.20 / 0.21
raw 2.23 / 2.32
WER 45.6% | Pass 9.1%
0.19 / 0.19
raw 2.05 / 2.14
WER 55.4% | Pass 9.1%
0.12 / 0.12
raw 1.36 / 1.36
WER 34.7% | Pass 9.1%
0.10 / 0.10
raw 1.00 / 1.00
WER 29.7% | Pass 10.0%
0.00 / 0.00
raw 1.50 / 1.50
WER 52.5% | Pass 0.0%
0.12 / 0.12
raw 1.11 / 1.11
WER 39.2% | Pass 11.1%
0.14 / 0.15
raw 1.25 / 1.36
WER 42.1% | Pass 11.1%
2.00 / 2.00
raw 2.00 / 2.00
WER nan% | Pass 100.0%
1.71 / 1.86
raw 1.71 / 1.86
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Pro
2.67 / 4.05
raw 2.69 / 4.07
WER 2.9% | Pass 99.6%
2.63 / 3.86
raw 2.63 / 3.86
WER 2.7% | Pass 100.0%
2.85 / 4.39
raw 2.85 / 4.39
WER 2.8% | Pass 100.0%
2.73 / 4.08
raw 2.73 / 4.08
WER 3.5% | Pass 100.0%
2.67 / 3.98
raw 2.67 / 3.98
WER 2.6% | Pass 100.0%
2.91 / 4.41
raw 2.91 / 4.41
WER 3.1% | Pass 100.0%
2.86 / 4.36
raw 2.86 / 4.36
WER 2.2% | Pass 100.0%
2.82 / 4.64
raw 2.82 / 4.64
WER 2.9% | Pass 100.0%
2.80 / 4.20
raw 2.80 / 4.20
WER 1.0% | Pass 100.0%
2.78 / 4.56
raw 2.78 / 4.56
WER 2.6% | Pass 100.0%
2.78 / 4.22
raw 2.78 / 4.22
WER 1.8% | Pass 100.0%
2.42 / 3.60
raw 2.49 / 3.70
WER 3.9% | Pass 97.3%
2.00 / 2.67
raw 2.00 / 2.67
WER nan% | Pass 100.0%
2.43 / 3.86
raw 2.43 / 3.86
WER nan% | Pass 100.0%
2.60 / 3.90
raw 2.60 / 3.90
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
Gemini 2.5-Flash
2.67 / 4.03
raw 2.70 / 4.08
WER 3.1% | Pass 98.8%
2.79 / 4.11
raw 2.83 / 4.17
WER 2.4% | Pass 98.6%
2.82 / 4.35
raw 2.82 / 4.35
WER 2.8% | Pass 100.0%
2.84 / 4.30
raw 2.84 / 4.30
WER 3.5% | Pass 100.0%
2.76 / 4.17
raw 2.82 / 4.26
WER 2.6% | Pass 98.0%
2.82 / 4.32
raw 2.82 / 4.32
WER 3.8% | Pass 100.0%
2.91 / 4.41
raw 2.91 / 4.41
WER 2.1% | Pass 100.0%
2.64 / 4.00
raw 2.64 / 4.00
WER 2.6% | Pass 100.0%
2.70 / 3.90
raw 2.70 / 3.90
WER 2.7% | Pass 100.0%
2.80 / 4.30
raw 2.80 / 4.30
WER 3.3% | Pass 100.0%
2.67 / 3.89
raw 2.67 / 3.89
WER 1.5% | Pass 100.0%
2.40 / 3.55
raw 2.54 / 3.76
WER 3.7% | Pass 94.6%
1.67 / 2.00
raw 1.67 / 2.00
WER nan% | Pass 100.0%
2.43 / 3.86
raw 2.43 / 3.86
WER nan% | Pass 100.0%
2.70 / 4.30
raw 2.70 / 4.30
WER nan% | Pass 100.0%
2.20 / 3.40
raw 2.20 / 3.40
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.44 / 3.39
raw 2.59 / 3.59
WER 11.0% | Pass 94.4%
2.77 / 3.52
raw 2.87 / 3.65
WER 10.4% | Pass 96.3%
2.47 / 3.50
raw 2.59 / 3.67
WER 10.8% | Pass 95.3%
2.69 / 3.48
raw 2.71 / 3.51
WER 9.3% | Pass 99.1%
2.55 / 3.43
raw 2.60 / 3.49
WER 9.7% | Pass 98.2%
2.77 / 3.89
raw 2.85 / 4.00
WER 10.1% | Pass 97.2%
2.62 / 3.62
raw 2.89 / 4.00
WER 12.1% | Pass 90.6%
2.13 / 3.08
raw 2.45 / 3.55
WER 12.7% | Pass 86.8%
2.13 / 3.19
raw 2.40 / 3.60
WER 12.3% | Pass 88.7%
1.98 / 2.64
raw 2.33 / 3.11
WER 13.2% | Pass 85.0%
2.47 / 3.70
raw 2.67 / 4.00
WER 11.6% | Pass 92.5%
2.59 / 3.87
raw 2.83 / 4.23
WER 11.9% | Pass 91.6%
0.80 / 0.80
raw 1.00 / 1.00
WER 14.9% | Pass 80.2%
2.02 / 2.73
raw 2.43 / 3.29
WER 13.8% | Pass 83.1%
1.30 / 1.69
raw 1.44 / 1.89
WER 12.7% | Pass 89.7%
2.32 / 3.73
raw 2.56 / 4.11
WER 12.1% | Pass 90.6%
Qwen3TTS-12Hz-1.7B-VD
2.38 / 3.32
raw 2.42 / 3.37
WER 3.9% | Pass 98.4%
2.63 / 3.52
raw 2.67 / 3.57
WER 4.2% | Pass 98.6%
2.46 / 3.55
raw 2.50 / 3.61
WER 3.5% | Pass 98.4%
2.44 / 3.46
raw 2.54 / 3.60
WER 4.4% | Pass 96.0%
2.60 / 3.74
raw 2.60 / 3.74
WER 3.9% | Pass 100.0%
2.77 / 4.13
raw 2.90 / 4.33
WER 3.5% | Pass 95.2%
2.73 / 3.64
raw 2.73 / 3.64
WER 3.7% | Pass 100.0%
1.45 / 1.64
raw 1.45 / 1.64
WER 4.1% | Pass 100.0%
2.50 / 3.50
raw 2.50 / 3.50
WER 1.5% | Pass 100.0%
2.30 / 3.30
raw 2.30 / 3.30
WER 4.2% | Pass 100.0%
2.11 / 2.44
raw 2.11 / 2.44
WER 4.9% | Pass 100.0%
2.34 / 3.10
raw 2.41 / 3.19
WER 3.9% | Pass 97.3%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
2.67 / 4.33
raw 2.67 / 4.33
WER nan% | Pass 100.0%
2.00 / 3.00
raw 2.00 / 3.00
WER nan% | Pass 100.0%
1.50 / 1.90
raw 1.50 / 1.90
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.87 / 2.30
raw 1.90 / 2.33
WER 4.2% | Pass 98.8%
2.11 / 2.51
raw 2.14 / 2.55
WER 4.6% | Pass 98.6%
1.99 / 2.49
raw 1.99 / 2.49
WER 3.7% | Pass 100.0%
1.68 / 1.94
raw 1.71 / 1.98
WER 5.8% | Pass 98.0%
2.10 / 2.65
raw 2.10 / 2.65
WER 3.7% | Pass 100.0%
2.14 / 2.48
raw 2.14 / 2.48
WER 4.0% | Pass 100.0%
2.14 / 2.68
raw 2.14 / 2.68
WER 3.1% | Pass 100.0%
2.55 / 3.45
raw 2.55 / 3.45
WER 5.0% | Pass 100.0%
2.00 / 2.20
raw 2.00 / 2.20
WER 1.0% | Pass 100.0%
1.70 / 2.10
raw 1.70 / 2.10
WER 2.9% | Pass 100.0%
2.22 / 2.56
raw 2.22 / 2.56
WER 3.7% | Pass 100.0%
1.43 / 1.71
raw 1.51 / 1.81
WER 4.5% | Pass 94.6%
1.83 / 2.17
raw 1.83 / 2.17
WER nan% | Pass 100.0%
2.43 / 3.86
raw 2.43 / 3.86
WER nan% | Pass 100.0%
1.40 / 1.80
raw 1.40 / 1.80
WER nan% | Pass 100.0%
1.40 / 1.80
raw 1.40 / 1.80
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
1.79 / 2.14
raw 2.02 / 2.41
WER 18.1% | Pass 88.8%
2.02 / 2.36
raw 2.35 / 2.74
WER 34.0% | Pass 86.1%
1.84 / 2.24
raw 2.13 / 2.59
WER 23.2% | Pass 86.5%
1.85 / 2.17
raw 2.23 / 2.62
WER 14.7% | Pass 83.0%
1.83 / 2.22
raw 2.08 / 2.52
WER 12.0% | Pass 88.0%
2.08 / 2.57
raw 2.55 / 3.14
WER 17.3% | Pass 81.8%
2.08 / 2.51
raw 2.41 / 2.91
WER 76.4% | Pass 86.4%
1.19 / 1.19
raw 1.45 / 1.45
WER 10.0% | Pass 81.8%
2.20 / 2.80
raw 2.20 / 2.80
WER 2.4% | Pass 100.0%
1.90 / 2.20
raw 1.90 / 2.20
WER 8.7% | Pass 100.0%
2.11 / 2.78
raw 2.11 / 2.78
WER 2.6% | Pass 100.0%
1.66 / 2.08
raw 1.92 / 2.41
WER 9.6% | Pass 86.5%
1.40 / 1.40
raw 1.40 / 1.40
WER nan% | Pass 100.0%
1.57 / 1.57
raw 1.57 / 1.57
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.56 / 0.59
raw 1.89 / 2.01
WER 104.6% | Pass 29.5%
0.30 / 0.30
raw 2.14 / 2.18
WER 93.7% | Pass 13.9%
0.44 / 0.47
raw 1.95 / 2.10
WER 114.3% | Pass 22.5%
0.56 / 0.60
raw 2.14 / 2.32
WER 65.6% | Pass 26.0%
0.41 / 0.42
raw 2.04 / 2.12
WER 130.9% | Pass 20.0%
0.40 / 0.43
raw 2.23 / 2.36
WER 124.8% | Pass 18.2%
0.19 / 0.19
raw 2.05 / 2.09
WER 85.8% | Pass 9.1%
0.42 / 0.42
raw 1.55 / 1.55
WER 81.2% | Pass 27.3%
0.26 / 0.30
raw 1.30 / 1.50
WER 209.2% | Pass 20.0%
0.57 / 0.60
raw 1.90 / 2.00
WER 50.7% | Pass 30.0%
0.00 / 0.00
raw 1.11 / 1.11
WER 108.2% | Pass 0.0%
0.19 / 0.20
raw 1.41 / 1.51
WER 113.5% | Pass 13.5%
3.00 / 3.00
raw 3.00 / 3.00
WER nan% | Pass 100.0%
2.57 / 2.86
raw 2.57 / 2.86
WER nan% | Pass 100.0%
1.60 / 1.60
raw 1.60 / 1.60
WER nan% | Pass 100.0%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
0.54 / 0.55
raw 1.50 / 1.53
WER 57.2% | Pass 36.0%
0.34 / 0.34
raw 1.74 / 1.75
WER 68.1% | Pass 19.4%
0.50 / 0.51
raw 1.62 / 1.67
WER 55.6% | Pass 30.9%
0.40 / 0.41
raw 1.82 / 1.87
WER 56.0% | Pass 22.2%
0.56 / 0.57
raw 1.71 / 1.76
WER 50.8% | Pass 32.7%
0.52 / 0.53
raw 1.91 / 1.95
WER 53.0% | Pass 27.3%
0.28 / 0.28
raw 1.55 / 1.55
WER 90.5% | Pass 18.2%
0.21 / 0.21
raw 1.18 / 1.18
WER 51.3% | Pass 18.2%
0.40 / 0.40
raw 1.00 / 1.00
WER 36.8% | Pass 40.0%
0.60 / 0.64
raw 1.50 / 1.60
WER 46.0% | Pass 40.0%
0.11 / 0.11
raw 1.00 / 1.00
WER 97.0% | Pass 11.1%
0.30 / 0.30
raw 1.19 / 1.19
WER 50.5% | Pass 25.0%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
1.29 / 1.29
raw 1.29 / 1.29
WER nan% | Pass 100.0%
1.10 / 1.10
raw 1.10 / 1.10
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.35 / 0.39
raw 1.90 / 2.11
WER 72.9% | Pass 18.4%
0.04 / 0.04
raw 2.23 / 2.36
WER 73.5% | Pass 1.8%
0.06 / 0.07
raw 1.90 / 2.09
WER 72.5% | Pass 3.3%
0.13 / 0.15
raw 2.18 / 2.45
WER 70.7% | Pass 6.1%
0.15 / 0.16
raw 2.20 / 2.41
WER 73.0% | Pass 6.8%
0.00 / 0.00
raw 2.36 / 2.68
WER 68.6% | Pass 0.0%
0.00 / 0.00
raw 1.68 / 1.73
WER 76.3% | Pass 0.0%
0.00 / 0.00
raw 1.82 / 1.91
WER 66.1% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 68.5% | Pass 0.0%
0.00 / 0.00
raw 1.70 / 1.70
WER 81.7% | Pass 0.0%
0.00 / 0.00
raw 1.11 / 1.11
WER 117.9% | Pass 0.0%
0.25 / 0.28
raw 1.47 / 1.67
WER 64.8% | Pass 16.7%
2.00 / 2.17
raw 2.00 / 2.17
WER nan% | Pass 100.0%
2.86 / 3.86
raw 2.86 / 3.86
WER nan% | Pass 100.0%
2.11 / 2.33
raw 2.11 / 2.33
WER nan% | Pass 100.0%
1.67 / 2.11
raw 1.67 / 2.11
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.16 / 0.16
raw 1.20 / 1.22
WER 161.7% | Pass 13.1%
0.00 / 0.00
raw 1.49 / 1.50
WER 164.7% | Pass 0.0%
0.00 / 0.00
raw 1.11 / 1.13
WER 149.2% | Pass 0.0%
0.00 / 0.00
raw 1.39 / 1.39
WER 109.9% | Pass 0.0%
0.00 / 0.00
raw 1.30 / 1.34
WER 114.6% | Pass 0.0%
0.00 / 0.00
raw 1.27 / 1.27
WER 200.0% | Pass 0.0%
0.00 / 0.00
raw 1.18 / 1.23
WER 192.0% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 157.5% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 133.5% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 131.2% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 208.6% | Pass 0.0%
0.00 / 0.00
raw 1.08 / 1.11
WER 261.1% | Pass 0.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.40 / 1.40
raw 1.40 / 1.40
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.65 / 4.08
raw 2.65 / 4.08
WER 1.4% | Pass 100.0%
2.61 / 4.00
raw 2.61 / 4.00
WER 0.8% | Pass 100.0%
2.78 / 4.31
raw 2.78 / 4.31
WER 1.4% | Pass 100.0%
2.55 / 3.92
raw 2.55 / 3.92
WER 2.4% | Pass 100.0%
2.88 / 4.53
raw 2.88 / 4.53
WER 1.4% | Pass 100.0%
2.68 / 3.95
raw 2.68 / 3.95
WER 0.2% | Pass 100.0%
2.95 / 4.77
raw 2.95 / 4.77
WER 0.4% | Pass 100.0%
2.45 / 3.73
raw 2.45 / 3.73
WER 0.1% | Pass 100.0%
2.80 / 4.40
raw 2.80 / 4.40
WER 1.4% | Pass 100.0%
2.30 / 3.10
raw 2.30 / 3.10
WER 1.7% | Pass 100.0%
2.89 / 4.44
raw 2.89 / 4.44
WER 0.3% | Pass 100.0%
2.51 / 3.86
raw 2.51 / 3.86
WER 1.9% | Pass 100.0%
1.50 / 2.00
raw 1.50 / 2.00
WER nan% | Pass 100.0%
2.29 / 3.43
raw 2.29 / 3.43
WER nan% | Pass 100.0%
2.60 / 4.00
raw 2.60 / 4.00
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.68 / 4.13
raw 2.68 / 4.13
WER 1.2% | Pass 100.0%
2.77 / 4.26
raw 2.77 / 4.26
WER 0.7% | Pass 100.0%
2.79 / 4.31
raw 2.79 / 4.31
WER 1.3% | Pass 100.0%
2.51 / 3.68
raw 2.51 / 3.68
WER 2.0% | Pass 100.0%
2.86 / 4.46
raw 2.86 / 4.46
WER 0.4% | Pass 100.0%
3.00 / 4.75
raw 3.00 / 4.75
WER 0.5% | Pass 100.0%
2.86 / 4.59
raw 2.86 / 4.59
WER 0.3% | Pass 100.0%
2.64 / 4.00
raw 2.64 / 4.00
WER 1.0% | Pass 100.0%
2.80 / 4.40
raw 2.80 / 4.40
WER 1.0% | Pass 100.0%
2.80 / 4.30
raw 2.80 / 4.30
WER 1.2% | Pass 100.0%
2.89 / 4.67
raw 2.89 / 4.67
WER 0.6% | Pass 100.0%
2.51 / 3.84
raw 2.51 / 3.84
WER 1.7% | Pass 100.0%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER nan% | Pass 100.0%
2.70 / 4.30
raw 2.70 / 4.30
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.49 / 3.46
raw 2.49 / 3.46
WER 1.2% | Pass 100.0%
2.71 / 3.57
raw 2.71 / 3.57
WER 1.1% | Pass 100.0%
2.44 / 3.37
raw 2.44 / 3.37
WER 1.2% | Pass 100.0%
2.71 / 3.50
raw 2.71 / 3.50
WER 1.0% | Pass 100.0%
2.46 / 3.29
raw 2.46 / 3.29
WER 1.1% | Pass 100.0%
2.94 / 4.22
raw 2.94 / 4.22
WER 1.1% | Pass 100.0%
2.51 / 3.57
raw 2.61 / 3.72
WER 1.3% | Pass 96.0%
1.42 / 1.59
raw 1.55 / 1.73
WER 1.4% | Pass 92.0%
2.26 / 3.38
raw 2.40 / 3.60
WER 1.3% | Pass 94.0%
2.00 / 2.70
raw 2.22 / 3.00
WER 1.4% | Pass 90.0%
2.94 / 4.57
raw 3.00 / 4.67
WER 1.3% | Pass 98.0%
2.62 / 3.94
raw 2.70 / 4.06
WER 1.3% | Pass 97.0%
0.85 / 0.85
raw 1.00 / 1.00
WER 1.6% | Pass 85.0%
2.14 / 3.39
raw 2.43 / 3.86
WER 1.5% | Pass 88.0%
1.23 / 1.23
raw 1.30 / 1.30
WER 1.4% | Pass 95.0%
2.69 / 4.42
raw 2.80 / 4.60
WER 1.3% | Pass 96.0%
Qwen3TTS-12Hz-1.7B-VD
2.55 / 3.77
raw 2.56 / 3.78
WER 2.2% | Pass 99.6%
2.70 / 3.96
raw 2.70 / 3.96
WER 1.7% | Pass 100.0%
2.60 / 3.89
raw 2.62 / 3.92
WER 2.6% | Pass 99.2%
2.69 / 4.10
raw 2.69 / 4.10
WER 1.4% | Pass 100.0%
2.72 / 4.19
raw 2.78 / 4.28
WER 3.1% | Pass 98.0%
2.90 / 4.25
raw 2.90 / 4.25
WER 3.4% | Pass 100.0%
2.68 / 4.00
raw 2.68 / 4.00
WER 0.6% | Pass 100.0%
2.09 / 2.73
raw 2.09 / 2.73
WER 1.0% | Pass 100.0%
2.60 / 3.80
raw 2.60 / 3.80
WER 3.4% | Pass 100.0%
2.20 / 3.10
raw 2.20 / 3.10
WER 0.6% | Pass 100.0%
2.50 / 3.25
raw 2.50 / 3.25
WER 3.3% | Pass 100.0%
2.57 / 3.78
raw 2.57 / 3.78
WER 1.9% | Pass 100.0%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
3.00 / 5.00
raw 3.00 / 5.00
WER nan% | Pass 100.0%
1.70 / 2.10
raw 1.70 / 2.10
WER nan% | Pass 100.0%
1.90 / 2.70
raw 1.90 / 2.70
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
2.22 / 3.07
raw 2.22 / 3.07
WER 1.1% | Pass 100.0%
2.58 / 3.58
raw 2.58 / 3.58
WER 1.1% | Pass 100.0%
2.38 / 3.33
raw 2.38 / 3.33
WER 1.3% | Pass 100.0%
2.51 / 3.55
raw 2.51 / 3.55
WER 1.9% | Pass 100.0%
2.32 / 3.36
raw 2.32 / 3.36
WER 0.6% | Pass 100.0%
2.84 / 4.26
raw 2.84 / 4.26
WER 0.4% | Pass 100.0%
2.50 / 3.36
raw 2.50 / 3.36
WER 0.7% | Pass 100.0%
1.82 / 2.18
raw 1.82 / 2.18
WER 0.2% | Pass 100.0%
2.30 / 3.30
raw 2.30 / 3.30
WER 1.5% | Pass 100.0%
2.30 / 3.10
raw 2.30 / 3.10
WER nan% | Pass 100.0%
2.11 / 2.89
raw 2.11 / 2.89
WER 0.3% | Pass 100.0%
2.24 / 3.05
raw 2.24 / 3.05
WER 0.7% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.22 / 1.44
raw 1.22 / 1.44
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.85 / 2.33
raw 1.85 / 2.33
WER 2.2% | Pass 100.0%
1.88 / 2.22
raw 1.88 / 2.22
WER 1.4% | Pass 100.0%
2.02 / 2.63
raw 2.02 / 2.63
WER 1.8% | Pass 100.0%
2.06 / 2.78
raw 2.06 / 2.78
WER 2.6% | Pass 100.0%
1.78 / 2.18
raw 1.78 / 2.18
WER 1.4% | Pass 100.0%
2.30 / 3.00
raw 2.30 / 3.00
WER 1.0% | Pass 100.0%
1.86 / 2.32
raw 1.86 / 2.32
WER 0.9% | Pass 100.0%
2.00 / 2.55
raw 2.00 / 2.55
WER 2.1% | Pass 100.0%
1.80 / 2.20
raw 1.80 / 2.20
WER 1.1% | Pass 100.0%
2.20 / 2.70
raw 2.20 / 2.70
WER 1.5% | Pass 100.0%
1.78 / 2.00
raw 1.78 / 2.00
WER 4.5% | Pass 100.0%
1.65 / 1.97
raw 1.65 / 1.97
WER 3.0% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.71 / 2.29
raw 1.71 / 2.29
WER nan% | Pass 100.0%
1.40 / 1.80
raw 1.40 / 1.80
WER nan% | Pass 100.0%
1.50 / 1.90
raw 1.50 / 1.90
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
0.24 / 0.25
raw 1.43 / 1.53
WER 131.8% | Pass 16.6%
0.10 / 0.11
raw 1.61 / 1.74
WER 89.9% | Pass 6.1%
0.09 / 0.10
raw 1.52 / 1.63
WER 143.1% | Pass 6.2%
0.15 / 0.15
raw 1.86 / 1.92
WER 135.4% | Pass 8.0%
0.06 / 0.07
raw 1.52 / 1.68
WER 81.6% | Pass 4.0%
0.08 / 0.09
raw 1.65 / 1.75
WER 129.7% | Pass 5.0%
0.07 / 0.08
raw 1.50 / 1.68
WER 79.7% | Pass 4.5%
0.00 / 0.00
raw 1.18 / 1.27
WER 85.1% | Pass 0.0%
0.14 / 0.17
raw 1.40 / 1.70
WER 502.5% | Pass 10.0%
0.10 / 0.10
raw 1.00 / 1.00
WER 75.4% | Pass 10.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 119.8% | Pass 0.0%
0.00 / 0.00
raw 1.14 / 1.22
WER 161.7% | Pass 0.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.54 / 0.58
raw 1.79 / 1.94
WER 66.8% | Pass 29.9%
0.37 / 0.40
raw 2.05 / 2.17
WER 67.7% | Pass 18.3%
0.44 / 0.49
raw 1.79 / 1.96
WER 62.8% | Pass 24.8%
0.12 / 0.13
raw 2.00 / 2.10
WER 71.4% | Pass 6.1%
0.39 / 0.42
raw 2.14 / 2.32
WER 70.8% | Pass 18.0%
0.88 / 0.92
raw 1.95 / 2.05
WER 61.6% | Pass 45.0%
0.62 / 0.74
raw 1.95 / 2.32
WER 57.6% | Pass 31.8%
0.32 / 0.32
raw 1.18 / 1.18
WER 60.1% | Pass 27.3%
0.24 / 0.24
raw 1.20 / 1.20
WER 62.6% | Pass 20.0%
0.32 / 0.32
raw 1.60 / 1.60
WER 59.4% | Pass 20.0%
0.00 / 0.00
raw 1.56 / 1.78
WER 72.3% | Pass 0.0%
0.29 / 0.31
raw 1.31 / 1.42
WER 67.9% | Pass 22.2%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.83 / 3.50
raw 2.83 / 3.50
WER nan% | Pass 100.0%
1.80 / 2.00
raw 1.80 / 2.00
WER nan% | Pass 100.0%
1.50 / 1.70
raw 1.50 / 1.70
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.46 / 0.49
raw 1.73 / 1.82
WER 132.1% | Pass 26.7%
0.22 / 0.22
raw 2.00 / 2.02
WER 110.1% | Pass 11.0%
0.30 / 0.32
raw 1.72 / 1.85
WER 104.3% | Pass 17.3%
0.37 / 0.40
raw 2.08 / 2.20
WER 78.2% | Pass 18.0%
0.34 / 0.37
raw 1.86 / 2.00
WER 103.8% | Pass 18.4%
0.61 / 0.64
raw 2.32 / 2.42
WER 197.8% | Pass 26.3%
0.00 / 0.00
raw 1.73 / 1.73
WER 101.2% | Pass 0.0%
0.39 / 0.39
raw 1.30 / 1.30
WER 66.0% | Pass 30.0%
0.24 / 0.28
raw 1.20 / 1.40
WER 75.8% | Pass 20.0%
0.00 / 0.00
raw 1.30 / 1.30
WER 162.7% | Pass 0.0%
0.00 / 0.00
raw 1.44 / 1.56
WER 77.5% | Pass 0.0%
0.18 / 0.18
raw 1.14 / 1.14
WER 267.8% | Pass 16.2%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
2.86 / 3.00
raw 2.86 / 3.00
WER nan% | Pass 100.0%
1.50 / 1.80
raw 1.50 / 1.80
WER nan% | Pass 100.0%
1.70 / 1.70
raw 1.70 / 1.70
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.15 / 0.16
raw 1.22 / 1.24
WER 207.7% | Pass 12.7%
0.00 / 0.00
raw 1.39 / 1.41
WER 216.9% | Pass 0.0%
0.00 / 0.00
raw 1.24 / 1.28
WER 185.1% | Pass 0.0%
0.00 / 0.00
raw 1.49 / 1.53
WER 226.7% | Pass 0.0%
0.00 / 0.00
raw 1.33 / 1.33
WER 163.7% | Pass 0.0%
0.00 / 0.00
raw 1.35 / 1.40
WER 100.8% | Pass 0.0%
0.00 / 0.00
raw 1.14 / 1.23
WER 292.6% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 309.0% | Pass 0.0%
0.00 / 0.00
raw 1.10 / 1.10
WER 104.7% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 99.9% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 434.7% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 232.6% | Pass 0.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.54 / 3.80
raw 2.63 / 3.94
WER 2.4% | Pass 96.4%
2.53 / 3.65
raw 2.63 / 3.79
WER 1.8% | Pass 96.2%
2.66 / 4.05
raw 2.77 / 4.21
WER 2.3% | Pass 96.1%
2.38 / 3.53
raw 2.64 / 3.92
WER 4.1% | Pass 90.0%
2.65 / 4.09
raw 2.76 / 4.26
WER 2.4% | Pass 96.0%
2.77 / 4.09
raw 2.77 / 4.09
WER 1.7% | Pass 100.0%
2.86 / 4.27
raw 2.86 / 4.27
WER 0.4% | Pass 100.0%
2.15 / 2.98
raw 2.36 / 3.27
WER 2.7% | Pass 90.9%
2.60 / 3.80
raw 2.60 / 3.80
WER 1.9% | Pass 100.0%
2.56 / 3.67
raw 2.56 / 3.67
WER 1.0% | Pass 100.0%
2.62 / 4.12
raw 2.62 / 4.12
WER 1.7% | Pass 100.0%
2.39 / 3.58
raw 2.46 / 3.68
WER 2.2% | Pass 97.3%
2.00 / 2.67
raw 2.00 / 2.67
WER nan% | Pass 100.0%
2.71 / 4.14
raw 2.71 / 4.14
WER nan% | Pass 100.0%
2.50 / 3.90
raw 2.50 / 3.90
WER nan% | Pass 100.0%
2.60 / 3.80
raw 2.60 / 3.80
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.48 / 3.64
raw 2.60 / 3.80
WER 2.9% | Pass 95.6%
2.57 / 3.60
raw 2.67 / 3.75
WER 1.7% | Pass 96.2%
2.58 / 3.83
raw 2.70 / 4.02
WER 2.9% | Pass 95.3%
2.35 / 3.48
raw 2.69 / 3.98
WER 6.3% | Pass 87.5%
2.55 / 3.71
raw 2.66 / 3.86
WER 2.3% | Pass 96.0%
2.73 / 4.14
raw 2.73 / 4.14
WER 0.0% | Pass 100.0%
2.91 / 4.41
raw 2.91 / 4.41
WER 0.0% | Pass 100.0%
2.23 / 3.06
raw 2.45 / 3.36
WER 3.4% | Pass 90.9%
2.50 / 3.30
raw 2.50 / 3.30
WER 1.9% | Pass 100.0%
2.50 / 3.50
raw 2.50 / 3.50
WER 0.9% | Pass 100.0%
2.50 / 3.38
raw 2.50 / 3.38
WER 1.7% | Pass 100.0%
2.33 / 3.43
raw 2.46 / 3.62
WER 3.1% | Pass 94.6%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
2.43 / 3.57
raw 2.43 / 3.57
WER nan% | Pass 100.0%
2.70 / 4.10
raw 2.70 / 4.10
WER nan% | Pass 100.0%
2.60 / 4.20
raw 2.60 / 4.20
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.33 / 3.24
raw 2.42 / 3.37
WER 2.6% | Pass 96.1%
2.64 / 3.55
raw 2.69 / 3.62
WER 2.5% | Pass 98.0%
2.29 / 3.23
raw 2.36 / 3.32
WER 2.5% | Pass 97.1%
2.38 / 3.15
raw 2.38 / 3.15
WER 2.2% | Pass 100.0%
2.57 / 3.69
raw 2.57 / 3.69
WER 2.3% | Pass 99.9%
2.82 / 3.91
raw 2.85 / 3.95
WER 2.4% | Pass 99.0%
2.49 / 3.41
raw 2.70 / 3.70
WER 2.9% | Pass 92.3%
1.85 / 2.41
raw 2.09 / 2.73
WER 3.0% | Pass 88.4%
1.81 / 2.53
raw 2.00 / 2.80
WER 2.9% | Pass 90.3%
1.63 / 2.40
raw 1.89 / 2.78
WER 3.1% | Pass 86.5%
2.59 / 3.65
raw 2.75 / 3.88
WER 2.7% | Pass 94.2%
2.58 / 3.84
raw 2.76 / 4.12
WER 2.8% | Pass 93.2%
0.82 / 0.82
raw 1.00 / 1.00
WER 3.5% | Pass 81.7%
1.57 / 1.69
raw 1.86 / 2.00
WER 3.2% | Pass 84.6%
1.10 / 1.28
raw 1.20 / 1.40
WER 3.0% | Pass 91.3%
2.68 / 4.15
raw 2.90 / 4.50
WER 2.9% | Pass 92.3%
Qwen3TTS-12Hz-1.7B-VD
2.37 / 3.21
raw 2.47 / 3.34
WER 2.8% | Pass 96.0%
2.62 / 3.39
raw 2.72 / 3.52
WER 2.0% | Pass 96.2%
2.46 / 3.44
raw 2.58 / 3.61
WER 2.8% | Pass 95.3%
2.43 / 3.24
raw 2.64 / 3.52
WER 3.7% | Pass 92.0%
2.42 / 3.24
raw 2.69 / 3.61
WER 4.3% | Pass 89.8%
2.81 / 3.90
raw 2.81 / 3.90
WER 2.1% | Pass 100.0%
2.86 / 3.91
raw 2.86 / 3.91
WER 0.5% | Pass 100.0%
1.82 / 2.27
raw 1.82 / 2.27
WER 2.3% | Pass 100.0%
1.90 / 2.50
raw 1.90 / 2.50
WER 0.0% | Pass 100.0%
2.44 / 3.56
raw 2.44 / 3.56
WER 1.9% | Pass 100.0%
2.12 / 2.62
raw 2.12 / 2.62
WER 3.1% | Pass 100.0%
2.26 / 3.02
raw 2.32 / 3.11
WER 2.5% | Pass 97.3%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.71 / 4.43
raw 2.71 / 4.43
WER nan% | Pass 100.0%
2.00 / 2.60
raw 2.00 / 2.60
WER nan% | Pass 100.0%
2.20 / 3.40
raw 2.20 / 3.40
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
2.12 / 2.94
raw 2.19 / 3.04
WER 2.4% | Pass 96.5%
2.28 / 3.22
raw 2.41 / 3.39
WER 2.0% | Pass 94.9%
2.24 / 3.18
raw 2.31 / 3.28
WER 2.2% | Pass 96.9%
2.18 / 3.02
raw 2.42 / 3.36
WER 5.2% | Pass 90.0%
2.24 / 3.12
raw 2.38 / 3.32
WER 2.8% | Pass 94.0%
2.50 / 3.82
raw 2.50 / 3.82
WER 0.7% | Pass 100.0%
2.50 / 3.68
raw 2.50 / 3.68
WER 0.0% | Pass 100.0%
1.91 / 2.36
raw 1.91 / 2.36
WER 1.7% | Pass 100.0%
1.90 / 2.50
raw 1.90 / 2.50
WER 0.0% | Pass 100.0%
1.90 / 2.60
raw 1.90 / 2.60
WER 0.9% | Pass 100.0%
2.50 / 3.62
raw 2.50 / 3.62
WER 2.4% | Pass 100.0%
2.21 / 3.00
raw 2.27 / 3.08
WER 1.7% | Pass 97.3%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.60 / 2.20
raw 1.60 / 2.20
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.72 / 2.15
raw 1.81 / 2.26
WER 3.1% | Pass 95.1%
1.76 / 2.08
raw 1.86 / 2.20
WER 2.2% | Pass 94.7%
1.73 / 2.23
raw 1.84 / 2.38
WER 3.2% | Pass 93.7%
1.62 / 1.94
raw 1.80 / 2.16
WER 5.2% | Pass 90.0%
1.77 / 2.33
raw 1.89 / 2.49
WER 3.0% | Pass 93.6%
1.99 / 2.47
raw 2.10 / 2.60
WER 0.4% | Pass 95.0%
1.64 / 2.00
raw 1.64 / 2.00
WER 0.4% | Pass 100.0%
2.09 / 2.73
raw 2.09 / 2.73
WER 1.1% | Pass 100.0%
1.62 / 2.16
raw 1.80 / 2.40
WER 3.1% | Pass 90.0%
1.53 / 1.80
raw 1.70 / 2.00
WER 2.7% | Pass 90.0%
2.12 / 2.88
raw 2.12 / 2.88
WER 3.1% | Pass 100.0%
1.51 / 1.79
raw 1.56 / 1.84
WER 3.0% | Pass 96.9%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
2.14 / 3.00
raw 2.14 / 3.00
WER nan% | Pass 100.0%
1.50 / 1.90
raw 1.50 / 1.90
WER nan% | Pass 100.0%
1.90 / 2.30
raw 1.90 / 2.30
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
1.10 / 1.10
raw 1.41 / 1.41
WER 11.9% | Pass 78.3%
1.32 / 1.32
raw 1.84 / 1.84
WER 12.5% | Pass 72.2%
1.08 / 1.08
raw 1.38 / 1.38
WER 11.5% | Pass 77.7%
1.17 / 1.17
raw 1.62 / 1.62
WER 13.6% | Pass 72.0%
1.11 / 1.11
raw 1.68 / 1.68
WER 15.4% | Pass 66.0%
1.37 / 1.37
raw 1.77 / 1.77
WER 8.9% | Pass 77.3%
1.23 / 1.23
raw 1.50 / 1.50
WER 6.2% | Pass 81.8%
0.86 / 0.86
raw 1.18 / 1.18
WER 19.5% | Pass 72.7%
1.20 / 1.20
raw 1.20 / 1.20
WER 3.8% | Pass 100.0%
1.08 / 1.08
raw 1.20 / 1.20
WER 9.0% | Pass 90.0%
0.75 / 0.75
raw 1.00 / 1.00
WER 14.5% | Pass 75.0%
0.86 / 0.86
raw 1.11 / 1.11
WER 10.0% | Pass 77.8%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.61 / 0.65
raw 1.77 / 1.89
WER 37.5% | Pass 34.4%
0.57 / 0.58
raw 2.20 / 2.27
WER 35.0% | Pass 25.7%
0.53 / 0.58
raw 1.74 / 1.90
WER 37.6% | Pass 30.5%
0.52 / 0.55
raw 2.23 / 2.36
WER 41.6% | Pass 23.4%
0.60 / 0.64
raw 2.04 / 2.21
WER 39.7% | Pass 29.2%
1.01 / 1.12
raw 2.23 / 2.45
WER 27.5% | Pass 45.5%
0.32 / 0.33
raw 1.67 / 1.71
WER 36.8% | Pass 19.0%
0.35 / 0.35
raw 1.27 / 1.27
WER 31.1% | Pass 27.3%
0.36 / 0.36
raw 1.20 / 1.20
WER 34.7% | Pass 30.0%
0.45 / 0.45
raw 1.50 / 1.50
WER 35.9% | Pass 30.0%
0.17 / 0.19
raw 1.38 / 1.50
WER 36.6% | Pass 12.5%
0.13 / 0.13
raw 1.16 / 1.16
WER 39.3% | Pass 10.8%
1.33 / 1.33
raw 1.33 / 1.33
WER nan% | Pass 100.0%
2.71 / 3.57
raw 2.71 / 3.57
WER nan% | Pass 100.0%
1.78 / 2.00
raw 1.78 / 2.00
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.29 / 0.29
raw 1.38 / 1.39
WER 50.6% | Pass 21.0%
0.24 / 0.24
raw 1.72 / 1.72
WER 50.2% | Pass 13.9%
0.11 / 0.11
raw 1.37 / 1.39
WER 50.7% | Pass 7.7%
0.16 / 0.16
raw 1.60 / 1.64
WER 52.4% | Pass 10.0%
0.18 / 0.18
raw 1.50 / 1.52
WER 51.5% | Pass 12.0%
0.17 / 0.17
raw 1.82 / 1.82
WER 53.1% | Pass 9.1%
0.13 / 0.13
raw 1.45 / 1.45
WER 51.3% | Pass 9.1%
0.00 / 0.00
raw 1.18 / 1.18
WER 47.2% | Pass 0.0%
0.10 / 0.10
raw 1.00 / 1.00
WER 38.7% | Pass 10.0%
0.11 / 0.11
raw 1.10 / 1.10
WER 41.9% | Pass 10.0%
0.25 / 0.25
raw 1.00 / 1.00
WER 37.0% | Pass 25.0%
0.03 / 0.03
raw 1.03 / 1.03
WER 55.2% | Pass 2.9%
1.33 / 1.33
raw 1.33 / 1.33
WER nan% | Pass 100.0%
1.86 / 1.86
raw 1.86 / 1.86
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.24 / 0.24
raw 1.57 / 1.60
WER 67.6% | Pass 15.0%
0.07 / 0.07
raw 1.96 / 1.97
WER 66.8% | Pass 3.8%
0.04 / 0.04
raw 1.55 / 1.58
WER 67.5% | Pass 2.3%
0.15 / 0.15
raw 1.82 / 1.88
WER 66.9% | Pass 8.2%
0.00 / 0.00
raw 1.90 / 1.94
WER 75.3% | Pass 0.0%
0.00 / 0.00
raw 1.73 / 1.73
WER 68.6% | Pass 0.0%
0.07 / 0.07
raw 1.59 / 1.59
WER 62.3% | Pass 4.5%
0.00 / 0.00
raw 1.55 / 1.55
WER 62.5% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 57.7% | Pass 0.0%
0.00 / 0.00
raw 1.30 / 1.30
WER 62.1% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 65.7% | Pass 0.0%
0.00 / 0.00
raw 1.32 / 1.35
WER 64.5% | Pass 0.0%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.29 / 2.57
raw 2.29 / 2.57
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Model Overall avg Timbre avg Style avg Easy Hard Special
Tag Direct Simple-comp complex comp Persona Abnormal Nonverbal
multi-comp dynamic layered conflict scenario character disfluency dysphonia implicit explicit
Gemini 2.5-Flash
2.59 / 3.94
raw 2.69 / 4.08
WER 5.8% | Pass 96.4%
2.47 / 3.65
raw 2.57 / 3.80
WER 5.8% | Pass 96.2%
2.63 / 4.01
raw 2.76 / 4.21
WER 6.8% | Pass 95.4%
2.45 / 3.71
raw 2.72 / 4.12
WER 9.3% | Pass 90.0%
2.66 / 4.10
raw 2.84 / 4.37
WER 6.7% | Pass 93.9%
2.86 / 4.43
raw 3.00 / 4.64
WER 6.7% | Pass 95.5%
2.45 / 3.68
raw 2.45 / 3.68
WER 4.2% | Pass 100.0%
2.64 / 3.91
raw 2.64 / 3.91
WER 3.3% | Pass 100.0%
2.70 / 3.90
raw 2.70 / 3.90
WER 6.6% | Pass 100.0%
2.44 / 3.44
raw 2.44 / 3.44
WER 3.6% | Pass 100.0%
2.78 / 4.22
raw 2.78 / 4.22
WER 2.1% | Pass 100.0%
2.68 / 4.16
raw 2.68 / 4.16
WER 2.4% | Pass 100.0%
1.83 / 2.17
raw 1.83 / 2.17
WER nan% | Pass 100.0%
2.71 / 4.14
raw 2.71 / 4.14
WER nan% | Pass 100.0%
2.70 / 4.30
raw 2.70 / 4.30
WER nan% | Pass 100.0%
2.50 / 3.90
raw 2.50 / 3.90
WER nan% | Pass 100.0%
Gemini 2.5-Pro
2.62 / 3.80
raw 2.69 / 3.89
WER 6.6% | Pass 97.6%
2.57 / 3.55
raw 2.71 / 3.73
WER 7.5% | Pass 94.9%
2.69 / 3.95
raw 2.73 / 4.01
WER 7.4% | Pass 98.5%
2.48 / 3.55
raw 2.58 / 3.70
WER 6.7% | Pass 96.0%
2.52 / 3.55
raw 2.74 / 3.86
WER 8.4% | Pass 92.0%
2.86 / 4.18
raw 2.86 / 4.18
WER 7.1% | Pass 100.0%
2.95 / 4.18
raw 2.95 / 4.18
WER 5.8% | Pass 100.0%
2.50 / 3.90
raw 2.50 / 3.90
WER 4.7% | Pass 100.0%
2.60 / 4.00
raw 2.60 / 4.00
WER 9.4% | Pass 100.0%
2.44 / 3.22
raw 2.44 / 3.22
WER 10.5% | Pass 100.0%
2.89 / 4.33
raw 2.89 / 4.33
WER 2.6% | Pass 100.0%
2.70 / 3.89
raw 2.70 / 3.89
WER 3.6% | Pass 100.0%
2.17 / 2.50
raw 2.17 / 2.50
WER nan% | Pass 100.0%
3.00 / 5.00
raw 3.00 / 5.00
WER nan% | Pass 100.0%
2.60 / 4.00
raw 2.60 / 4.00
WER nan% | Pass 100.0%
2.40 / 3.70
raw 2.40 / 3.70
WER nan% | Pass 100.0%
ElevenLabs-ttv-v3
2.37 / 3.29
raw 2.42 / 3.36
WER 6.3% | Pass 97.8%
2.69 / 3.63
raw 2.69 / 3.64
WER 6.0% | Pass 99.8%
2.37 / 3.19
raw 2.40 / 3.23
WER 6.2% | Pass 98.8%
2.43 / 3.29
raw 2.43 / 3.29
WER 5.4% | Pass 100.0%
2.44 / 3.33
raw 2.44 / 3.33
WER 5.5% | Pass 100.0%
2.80 / 3.90
raw 2.80 / 3.90
WER 5.8% | Pass 100.0%
2.63 / 3.57
raw 2.80 / 3.80
WER 6.9% | Pass 93.9%
2.13 / 2.78
raw 2.36 / 3.09
WER 7.2% | Pass 90.0%
2.04 / 2.66
raw 2.22 / 2.89
WER 7.1% | Pass 91.9%
1.76 / 2.20
raw 2.00 / 2.50
WER 7.6% | Pass 88.0%
2.66 / 4.05
raw 2.78 / 4.22
WER 6.6% | Pass 95.8%
2.68 / 4.16
raw 2.82 / 4.38
WER 6.8% | Pass 94.9%
0.83 / 0.83
raw 1.00 / 1.00
WER 8.5% | Pass 83.1%
1.35 / 1.84
raw 1.57 / 2.14
WER 7.9% | Pass 86.1%
1.02 / 1.02
raw 1.10 / 1.10
WER 7.2% | Pass 92.9%
2.35 / 3.57
raw 2.50 / 3.80
WER 6.9% | Pass 93.9%
Qwen3TTS-12Hz-1.7B-VD
2.05 / 2.63
raw 2.16 / 2.76
WER 10.0% | Pass 95.2%
2.27 / 2.83
raw 2.42 / 3.03
WER 10.9% | Pass 93.6%
2.08 / 2.67
raw 2.22 / 2.85
WER 11.7% | Pass 93.7%
1.86 / 2.25
raw 2.16 / 2.62
WER 14.1% | Pass 86.0%
2.14 / 2.86
raw 2.35 / 3.13
WER 13.2% | Pass 91.3%
2.77 / 3.82
raw 2.77 / 3.82
WER 8.2% | Pass 100.0%
2.55 / 3.09
raw 2.55 / 3.09
WER 7.8% | Pass 100.0%
1.82 / 2.36
raw 1.82 / 2.36
WER 7.1% | Pass 100.0%
2.22 / 2.89
raw 2.22 / 2.89
WER 10.4% | Pass 100.0%
1.48 / 1.68
raw 1.67 / 1.89
WER 11.3% | Pass 88.9%
1.22 / 1.22
raw 1.22 / 1.22
WER 5.6% | Pass 100.0%
2.22 / 2.84
raw 2.22 / 2.84
WER 4.4% | Pass 100.0%
1.33 / 1.67
raw 1.33 / 1.67
WER nan% | Pass 100.0%
1.86 / 2.57
raw 1.86 / 2.57
WER nan% | Pass 100.0%
1.50 / 1.90
raw 1.50 / 1.90
WER nan% | Pass 100.0%
1.80 / 2.60
raw 1.80 / 2.60
WER nan% | Pass 100.0%
MiniMax-Speech-2.7
2.03 / 2.61
raw 2.07 / 2.66
WER 6.0% | Pass 98.0%
2.33 / 3.05
raw 2.39 / 3.13
WER 5.6% | Pass 97.5%
2.16 / 2.87
raw 2.21 / 2.94
WER 7.0% | Pass 97.7%
2.16 / 2.90
raw 2.30 / 3.08
WER 7.2% | Pass 94.0%
1.98 / 2.55
raw 2.06 / 2.66
WER 6.3% | Pass 96.0%
2.73 / 3.77
raw 2.73 / 3.77
WER 5.9% | Pass 100.0%
2.41 / 3.32
raw 2.41 / 3.32
WER 4.9% | Pass 100.0%
2.18 / 2.73
raw 2.18 / 2.73
WER 8.5% | Pass 100.0%
2.00 / 2.40
raw 2.00 / 2.40
WER 6.5% | Pass 100.0%
1.89 / 2.00
raw 1.89 / 2.00
WER 7.8% | Pass 100.0%
2.33 / 2.67
raw 2.33 / 2.67
WER 2.0% | Pass 100.0%
2.00 / 2.54
raw 2.00 / 2.54
WER 4.3% | Pass 100.0%
1.40 / 1.80
raw 1.40 / 1.80
WER nan% | Pass 100.0%
1.29 / 1.57
raw 1.29 / 1.57
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
GPT-4o-Mini-TTS
1.88 / 2.35
raw 1.93 / 2.40
WER 5.8% | Pass 97.6%
1.96 / 2.36
raw 2.04 / 2.46
WER 7.0% | Pass 96.2%
2.05 / 2.66
raw 2.10 / 2.72
WER 6.2% | Pass 97.5%
2.08 / 2.44
raw 2.12 / 2.49
WER 4.9% | Pass 98.0%
1.97 / 2.56
raw 2.10 / 2.72
WER 7.6% | Pass 94.0%
2.19 / 2.90
raw 2.19 / 2.90
WER 4.6% | Pass 100.0%
2.14 / 2.77
raw 2.14 / 2.77
WER 4.7% | Pass 100.0%
1.90 / 2.48
raw 2.09 / 2.73
WER 12.4% | Pass 90.9%
1.90 / 2.20
raw 1.90 / 2.20
WER 6.1% | Pass 100.0%
1.78 / 2.22
raw 1.78 / 2.22
WER 9.1% | Pass 100.0%
1.67 / 1.89
raw 1.67 / 1.89
WER 1.6% | Pass 100.0%
1.50 / 1.79
raw 1.54 / 1.84
WER 4.4% | Pass 97.3%
1.17 / 1.17
raw 1.17 / 1.17
WER nan% | Pass 100.0%
2.14 / 3.00
raw 2.14 / 3.00
WER nan% | Pass 100.0%
1.30 / 1.50
raw 1.30 / 1.50
WER nan% | Pass 100.0%
1.80 / 2.50
raw 1.80 / 2.50
WER nan% | Pass 100.0%
Ming-omni-tts-16.8B-A3B
0.89 / 0.91
raw 1.52 / 1.56
WER 39.9% | Pass 58.8%
1.15 / 1.16
raw 1.92 / 1.93
WER 36.4% | Pass 60.0%
0.77 / 0.78
raw 1.53 / 1.56
WER 41.2% | Pass 50.0%
1.07 / 1.09
raw 1.93 / 1.96
WER 41.3% | Pass 55.6%
0.70 / 0.71
raw 1.49 / 1.51
WER 44.5% | Pass 46.9%
0.78 / 0.82
raw 2.05 / 2.14
WER 41.2% | Pass 38.1%
0.99 / 0.99
raw 1.68 / 1.68
WER 35.2% | Pass 59.1%
0.99 / 0.99
raw 1.36 / 1.36
WER 28.8% | Pass 72.7%
0.80 / 0.80
raw 1.00 / 1.00
WER 25.8% | Pass 80.0%
0.68 / 0.68
raw 1.22 / 1.22
WER 38.7% | Pass 55.6%
0.56 / 0.56
raw 1.00 / 1.00
WER 37.4% | Pass 55.6%
0.66 / 0.70
raw 1.35 / 1.43
WER 40.7% | Pass 48.6%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.57 / 1.57
raw 1.57 / 1.57
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.20 / 1.30
raw 1.20 / 1.30
WER nan% | Pass 100.0%
MOSS-VoiceGenerator
0.30 / 0.31
raw 1.75 / 1.85
WER 86.7% | Pass 17.0%
0.08 / 0.08
raw 2.04 / 2.10
WER 87.9% | Pass 3.8%
0.12 / 0.13
raw 1.72 / 1.87
WER 83.9% | Pass 7.1%
0.16 / 0.16
raw 1.92 / 2.00
WER 92.4% | Pass 8.2%
0.00 / 0.00
raw 2.02 / 2.14
WER 94.5% | Pass 0.0%
0.20 / 0.21
raw 2.18 / 2.32
WER 88.4% | Pass 9.1%
0.16 / 0.17
raw 1.73 / 1.91
WER 77.7% | Pass 9.1%
0.00 / 0.00
raw 1.60 / 1.70
WER 71.9% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 72.1% | Pass 0.0%
0.16 / 0.16
raw 1.25 / 1.25
WER 76.2% | Pass 12.5%
0.00 / 0.00
raw 1.56 / 1.56
WER 76.8% | Pass 0.0%
0.00 / 0.00
raw 1.34 / 1.37
WER 86.7% | Pass 0.0%
1.67 / 1.67
raw 1.67 / 1.67
WER nan% | Pass 100.0%
2.43 / 2.86
raw 2.43 / 2.86
WER nan% | Pass 100.0%
1.40 / 1.60
raw 1.40 / 1.60
WER nan% | Pass 100.0%
1.50 / 1.50
raw 1.50 / 1.50
WER nan% | Pass 100.0%
MiMo-Audio-7B-Instruct
0.19 / 0.19
raw 1.41 / 1.44
WER 104.8% | Pass 13.5%
0.02 / 0.02
raw 1.79 / 1.81
WER 101.6% | Pass 1.3%
0.00 / 0.00
raw 1.37 / 1.41
WER 96.1% | Pass 0.0%
0.04 / 0.04
raw 1.78 / 1.78
WER 101.1% | Pass 2.0%
0.00 / 0.00
raw 1.57 / 1.67
WER 102.1% | Pass 0.0%
0.00 / 0.00
raw 1.55 / 1.59
WER 96.2% | Pass 0.0%
0.00 / 0.00
raw 1.52 / 1.52
WER 96.7% | Pass 0.0%
0.00 / 0.00
raw 1.18 / 1.18
WER 94.2% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 94.3% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 89.0% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 103.5% | Pass 0.0%
0.00 / 0.00
raw 1.24 / 1.30
WER 133.5% | Pass 0.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.14 / 1.14
raw 1.14 / 1.14
WER nan% | Pass 100.0%
1.20 / 1.20
raw 1.20 / 1.20
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
Ming-omni-tts-0.5B
0.16 / 0.17
raw 1.24 / 1.26
WER 179.3% | Pass 13.1%
0.00 / 0.00
raw 1.48 / 1.53
WER 256.1% | Pass 0.0%
0.00 / 0.00
raw 1.19 / 1.22
WER 146.1% | Pass 0.0%
0.00 / 0.00
raw 1.41 / 1.45
WER 316.9% | Pass 0.0%
0.00 / 0.00
raw 1.36 / 1.38
WER 153.8% | Pass 0.0%
0.00 / 0.00
raw 1.68 / 1.77
WER 158.6% | Pass 0.0%
0.00 / 0.00
raw 1.05 / 1.05
WER 113.8% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 101.8% | Pass 0.0%
0.00 / 0.00
raw 1.10 / 1.10
WER 237.9% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 181.8% | Pass 0.0%
0.00 / 0.00
raw 1.22 / 1.33
WER 104.7% | Pass 0.0%
0.00 / 0.00
raw 1.00 / 1.00
WER 105.7% | Pass 0.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.29 / 1.29
raw 1.29 / 1.29
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%
1.00 / 1.00
raw 1.00 / 1.00
WER nan% | Pass 100.0%

Detailed

Interactive Taxonomy: Click on the rows with a [▼] to expand or collapse sub-categories. The color scale maps from Red (lower scores) to Green (higher scores).
Taxonomy Node

Demos

Instruction Following

Notes:
  • Due to platform restrictions (e.g., safety filters), some models failed to synthesize certain instructions. As a result, audio files may be missing for some model-sample pairs.
  • ElevenLabs imposes a minimum text length requirement for synthesis. To satisfy this, the input text was repeated, meaning you will hear repeated speech in most ElevenLabs samples.
Sample 0 / 0

PLAN__... taxonomy/...

Instruction:

Text:

Target Values:

Timbre Diversity

Timbre Diversity Score (TDS) measures the ability to generate multiple diverse voices for the same instruction. Higher scores indicate greater voice variety. For each model, we fix a text and an instruction, generate 10 candidate samples, and keep the samples that pass instruction‑following evaluation for TDS computation. Below, we randomly select one Chinese case and one English case per model for demonstration.

Language Gemini 2.5-Flash Gemini 2.5-Pro ElevenLabs-ttv-v3 MiniMax-Speech-2.7 Hume-Octave1 GPT-4o-Mini-TTS Qwen3TTS-12Hz-1.7B-VD MOSS-VoiceGenerator MiMo-Audio-7B-Instruct Ming-omni-tts-16.8B-A3B Ming-omni-tts-0.5B Parler-TTS Large Parler-TTS Mini VoiceSculptor
English (EN) 0.46 0.57 0.31 0.44 0.37 0.37 0.37 0.47 0.52 0.32 0.35 0.42 0.33 -
Chinese (ZH) 0.32 0.39 0.18 0.38 - 0.25 0.28 0.35 0.37 0.22 0.23 - - 0.32

🔊 Listen to Timbre Diversity Samples

Select a model to hear multiple diverse voice renditions of the same instruction in English and Chinese.

English (EN) Samples

Select a model to load samples.

Select a model to load samples.

Chinese (ZH) Samples

Select a model to load samples.

Select a model to load samples.

Ethical Considerations & Disclaimer

Content Disclaimer: Owing to the large-scale, automated nature of our LLM-based data construction pipeline, we are unable to manually review every single instruction-text pair. As a result, the dataset may inadvertently contain unexpected or inappropriate content. Please note that the semantic content within the reference texts and the synthesized audio samples does NOT reflect the views, opinions, or endorsements of the authors or their affiliated institutions.

AI-Generated Voices: All audio samples showcased in this benchmark are entirely AI-generated by the evaluated Text-to-Speech models based on the given textual prompts. Any resemblance of the synthesized voices to real individuals is purely coincidental and an artifact of the models' internal capabilities and training data distributions. MINT-Bench does not intend to spoof, clone, or misrepresent any real person's identity.

Intended Use: MINT-Bench, its associated datasets, and these audio demonstrations are intended solely for academic and research purposes. The benchmark is designed to facilitate transparent evaluation and foster the responsible development of controllable speech generation systems.