Demos for FD-Bench
This page demonstrates some of the inference samples input using chatTTS, F5TTS and CosyVoice2 models, and the output of Moshi, Freeze-omni, and VITA-1.5 models.
Audio Example
Here we show three different interruption difficulty level, including easy, medium, and hard, for each TTS model.
The audio samples are generated from the following simulated conversation.
{
"num_rounds": 4,
"user": "Can you suggest a quick and easy dinner recipe? <Further Inquiry> Oh, do you know any vegetarian options? <Further Inquiry> Also, I need a dessert idea that's not too sweet. <Topic Shift> Got it. By the way, any tips on reducing food waste? ",
"ai": "Absolutely! Try making a stir-fry with your choice of protein, mixed vegetables, and soy sauce. Serve it with rice or noodles for a complete meal. <Further Inquiry> Of course, you can use tofu or tempeh instead of meat. Eggplant and bell peppers add great flavor and texture as well. <Further Inquiry> A fruit salad with a drizzle of honey and mint leaves is refreshing and not overly sweet. <Topic Shift> Try meal planning, storing food properly, and using leftovers creatively. Composting is another way to reduce waste. "
}
ChatTTS
Easy:
Medium:
Hard:
F5TTS
Easy:
Medium:
Hard:
CosyVoice2
Easy:
Medium:
Hard:
Easy with 10dB gap-noise:
Easy with 10dB bg-noise:
Response Example
Here we show the response of Moshi, Freeze-omni, and VITA-1.5 models for some of the above conversations. Please wear your headphones to listen to the audio samples. The left channel is the user’s inquiries, and the right channel is the AI’s response.