ChatTTS – Text-to-Speech

August 20, 20240 Comment1.7k

Discover The ChatTTS

ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. It supports both English and Chinese languages. Our model is trained with 100,000+ hours composed of chinese and english.

ChatTTS Features

Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.
Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.

FAQ

How much VRAM do I need? How about infer speed?

For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090D GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.65.

Model stability is not good enough, with issues such as multi speakers or poor audio quality.

This is a problem that typically occurs with autoregressive models(for bark and valle). It’s generally difficult to avoid. One can try multiple samples to find a suitable result.

Besides laughter, can we control anything else? Can we control other emotions?

In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities.

VISIT THE WEBSITE