Kokoro
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient.
Architecture:
StyleTTS 2: https://arxiv.org/abs/2306.07691
ISTFTNet: https://arxiv.org/abs/2203.02395
Decoder only: no diffusion, no encoder release
Architected by: Li et al @ https://github.com/yl4579/StyleTTS2
Trained by: @rzvzn on Discord
Languages: Multiple
Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4
Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
Total Dataset Size: A few hundred hours of audio