Experimental Results & Delay Analysis
Statistics of generated speech corpus
User inquiry speech corpus.
Easy / Medium / Hard interruption difficulties have diverse speech duration. For the noisy corpus, we generate three different SNR configurations for Easy portion of F5TTS and CosyVoice2.
| ID | TTS Engines | Duration (H) E/M/H | SNR |
|---|---|---|---|
| 1 | ChatTTS | 4.2 / 2.8 / 2.3 | - |
| 2 | F5TTS | 3.8 / 2.4 / 2.0 | - |
| 3 | +Noise-gap | 3.8 / - / - | 0/10/20 |
| 4 | +Noise-bg | 3.8 / - / - | 0/10/20 |
| 5 | CosyVoice2 | 4.2 / 2.8 / 2.4 | - |
| 6 | +Noise-gap | 4.2 / - / - | 0/10/20 |
| 7 | +Noise-bg | 4.2 / - / - | 0/10/20 |
Metrics summary
The evaluation of model robustness under varying levels of user interruptions. Only the results from speech input simulated using Cosyvoice2 are reported in this paper. WER, SRR, SRIR, SIR, EIR, and C-PPL are assessed over the entire output speech response, while IRD, FSED, ERT, and EIT are reported using the median of all collected time values to mitigate the influence of outliers. The Score is calculated as the arithmetic mean of the six scores corresponding to six sub-perspectives derived from all successful responses.
| Model | Data | WER↓ | SRR↑ | SRIR↑ | SIR↑ | EIR↓ | IRD↓ | FSED↓ | ERT | EIT | C-PPL | Score↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Moshi | 1-E | 6.0 | 49.8 | 73.3 | 80.8 | 33.2 | 1660 | 1545 | 232 | 1218 | 16 | - |
| Moshi | 1-M | 6.2 | 36.6 | 63.0 | 76.8 | 31.5 | 1436 | 1071 | 192 | 1245 | 22 | - |
| Moshi | 1-H | 5.4 | 27.5 | 63.5 | 71.6 | 26.6 | 1499 | 1005 | 168 | 1180 | 26 | - |
| Moshi | 2-E | 5.8 | 64.5 | 85.7 | 71.4 | 16.9 | 1472 | 212 | 61 | 1750 | 16 | - |
| Moshi | 2-M | 4.8 | 46.8 | 75.8 | 65.0 | 15.3 | 1404 | 130 | 55 | 1903 | 21 | - |
| Moshi | 2-H | 5.7 | 34.4 | 72.5 | 63.5 | 12.1 | 1451 | 127 | 58 | 1663 | 29 | - |
| Moshi | 5-E | 5.3 | 61.7 | 79.4 | 83.1 | 27.9 | 1345 | 4155 | 221 | 2008 | 20 | 4.43 |
| Moshi | 5-M | 5.0 | 45.6 | 78.4 | 77.1 | 23.1 | 1453 | 2735 | 235 | 1887 | 18 | 4.55 |
| Moshi | 5-H | 5.0 | 34.1 | 75.3 | 73.0 | 19.5 | 1527 | 2020 | 235 | 1649 | 25 | 4.42 |
| Freeze-omni | 1-E | - | 14.3 | 49.9 | 72.5 | 20.5 | 3461 | 632 | 249 | 1299 | 34 | - |
| Freeze-omni | 1-M | - | 10.9 | 30.6 | 50.7 | 12.1 | 13067 | 624 | 262 | 1470 | 24 | - |
| Freeze-omni | 1-H | - | 10.0 | 21.8 | 46.6 | 10.3 | 13359 | 534 | 228 | 1675 | 174 | - |
| Freeze-omni | 2-E | - | 14.5 | 40.0 | 64.8 | 20.2 | 12407 | 685 | 263 | 1510 | 53 | - |
| Freeze-omni | 2-M | - | 14.8 | 29.0 | 47.9 | 13.5 | 12315 | 648 | 235 | 1353 | 132 | - |
| Freeze-omni | 2-H | - | 14.3 | 22.9 | 45.0 | 14.6 | 12158 | 656 | 274 | 1454 | 145 | - |
| Freeze-omni | 5-E | - | 11.3 | 35.5 | 66.5 | 31.5 | 3618 | 515 | 180 | 2027 | 73 | 3.29 |
| Freeze-omni | 5-M | - | 11.7 | 21.7 | 49.4 | 23.5 | 12200 | 449 | 259 | 2047 | 80 | 3.38 |
| Freeze-omni | 5-H | - | 11.2 | 18.3 | 49.0 | 23.7 | 11927 | 456 | 311 | 2093 | 110 | 3.14 |
| VITA-1.5 | 1-E | - | 20.3 | 56.0 | 66.8 | 9.2 | 15288 | 4847 | 255 | 2122 | 29 | - |
| VITA-1.5 | 1-M | - | 13.9 | 48.0 | 69.3 | 18.4 | 14144 | 2932 | 325 | 2046 | 47 | - |
| VITA-1.5 | 1-H | - | 15.4 | 41.6 | 67.8 | 18.1 | 13768 | 1833 | 239 | 1916 | 43 | - |
| VITA-1.5 | 2-E | - | 21.0 | 53.3 | 68.2 | 14.0 | 13349 | 4853 | 312 | 1872 | 27 | - |
| VITA-1.5 | 2-M | - | 17.8 | 46.0 | 71.8 | 17.9 | 12632 | 2602 | 218 | 1829 | 52 | - |
| VITA-1.5 | 2-H | - | 17.4 | 40.0 | 68.1 | 18.9 | 12164 | 1989 | 164 | 1845 | 45 | - |
| VITA-1.5 | 5-E | - | 26.9 | 54.7 | 75.7 | 25.1 | 13063 | 4242 | 234 | 2289 | 28 | 2.68 |
| VITA-1.5 | 5-M | - | 18.3 | 48.5 | 81.9 | 32.5 | 10759 | 2601 | 176 | 2174 | 60 | 2.44 |
| VITA-1.5 | 5-H | - | 17.1 | 45.7 | 78.0 | 35.1 | 4651 | 1840 | 253 | 2270 | 50 | 2.27 |
Delay Distribution
Moshi
Chattts-Easy
Chattts-Med
Chattts-Hard
F5tts-Easy
F5tts-Med
F5tts-Hard
Cosyvoice2-Easy
Cosyvoice2-Med
Cosyvoice2-Hard
Freeze-omni
Chattts-Easy
Chattts-Med
Chattts-Hard
F5tts-Easy
F5tts-Med
F5tts-Hard
Cosyvoice2-Easy
Cosyvoice2-Med
Cosyvoice2-Hard
VITA-1.5
Chattts-Easy
Chattts-Med
Chattts-Hard
F5tts-Easy
F5tts-Med
F5tts-Hard
Cosyvoice2-Easy
Cosyvoice2-Med
Cosyvoice2-Hard