Discover The ChatTTS
ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. It supports both English and Chinese languages. Our model is trained with 100,000+ hours composed of chinese and english.
ChatTTS Features
- Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.
- Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
- Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.
FAQ
How much VRAM do I need? How about infer speed?
For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090D GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.65.
Model stability is not good enough, with issues such as multi speakers or poor audio quality.
This is a problem that typically occurs with autoregressive models(for bark and valle). It’s generally difficult to avoid. One can try multiple samples to find a suitable result.
Besides laughter, can we control anything else? Can we control other emotions?
In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities.