This page demonstrates some of the inference samples input using chatTTS, F5TTS and CosyVoice2 models, and the output of Moshi, Freeze-omni, and VITA-1.5 models.

Audio Example

Here we show three different interruption difficulty level, including easy, medium, and hard, for each TTS model. The audio samples are generated from the following simulated conversation. and are involved in this conversation, and the total round is 4.

{
    "num_rounds": 4,
    "user": "Can you suggest a quick and easy dinner recipe? <Further Inquiry> Oh, do you know any vegetarian options? <Further Inquiry> Also, I need a dessert idea that's not too sweet. <Topic Shift> Got it. By the way, any tips on reducing food waste? ",
    "ai": "Absolutely! Try making a stir-fry with your choice of protein, mixed vegetables, and soy sauce. Serve it with rice or noodles for a complete meal. <Further Inquiry> Of course, you can use tofu or tempeh instead of meat. Eggplant and bell peppers add great flavor and texture as well. <Further Inquiry> A fruit salad with a drizzle of honey and mint leaves is refreshing and not overly sweet. <Topic Shift> Try meal planning, storing food properly, and using leftovers creatively. Composting is another way to reduce waste. "
}

ChatTTS

Easy:

Medium:

Hard:

F5TTS

Easy:

Medium:

Hard:

CosyVoice2

Easy:

Medium:

Hard:

Easy with 10dB gap-noise:

Easy with 10dB bg-noise:

Response Example

Here we show the response of Moshi, Freeze-omni, and VITA-1.5 models for some of the above conversations. Please wear your headphones to listen to the audio samples. The left channel is the user’s inquiries, and the right channel is the AI’s response.

Moshi-cosyvoice2-med

Moshi-cosyvoice2-hard

Moshi-f5tts-hard

Freeze-omni-cosyvoice2-hard

Freeze-omni-chattts-easy

VITA-1.5-f5tts-easy

VITA-1.5-chattts-hard