minimax/speech

텍스트 음성 변환

N/A

Minimax

Today, we’re thrilled to introduce MiniMax Speech 2.6 — our latest speech model, bringing comprehensive upgrades with ultra-low latency, enhanced format handling, and a more natural, human-like voice for Voice Agent scenarios.

Since its launch, MiniMax Speech has become a core piece of infrastructure in the global voice intelligence landscape, known for its outstanding speech technology and exceptional cost-effectiveness.

From LiveKit, which powers ChatGPT's advanced voice mode, and the popular open-source framework Pipecat on GitHub, to the YC-incubated voice platform Vapi, all have chosen MiniMax Speech as their underlying technology engine. In the smart hardware sector, innovative products like Haivivi Bubble Pal, Fuzozo, and Rokid Glasses are also powered by MiniMax Speech to deliver their natural voice interaction experiences.

MiniMax continues to drive new forms of productivity through technological innovation, breaking down the barriers of language and culture to deliver natural, fluent interactions that connect every voice around the world.

1. Ultra-Low Latency, More Responsive: For Smoother Overall Interaction

We have completely optimized the audio generation pipeline, achieving an end-to-end latency of under 250 milliseconds—a top-tier industry standard. In scenarios with strict response time requirements, such as real-time conversations, audio generation is no longer the bottleneck, ensuring a smoother overall interaction.

2. Seamless Handling of Specialized Formats, Smarter: For More Fluid Information Delivery

Speech 2.6 now directly converts non-standard text formats in multiple languages, including URLs, email addresses, phone numbers, dates, and monetary amounts. Whether you are using it with a large language model or need to process dynamically changing entity information in your business, you no longer need to perform tedious text pre-processing. The input is read correctly from the start, enabling more fluid information delivery.

For example, to correctly read the following passage, traditional TTS would require a series of conversions:

+1 415 415 9921 → “plus one, four one five, four one five, nine nine two one”
$1,234.56 → “one thousand two hundred thirty-four dollars and fifty-six cents”
192.168.1.1 → “one nine two dot one six eight dot one dot one”
2032-5-6 → “May sixth, twenty thirty-two”
support-vip@technet.com → “support dash vip at technet dot com”

3. Greater Naturalness and Fluent LoRA: For More Fluent Vocal Expression

In addition to further enhancing prosodic naturalness, Speech 2.6 also introduces Fluent LoRA.

Speech 2.5 already offered a convenient, high-fidelity voice cloning feature that allowed users to preserve the unique characteristics of the original voice, such as accents and speech habits. This capability met the diverse voice needs of real-world application scenarios.

Now, you no longer have to worry about imperfect source material when cloning a voice. Even with non-native recordings that may have an accent or be disfluent, Fluent LoRA can perfectly replicate the voice's timbre while generating fluent, natural speech that matches the target text, making your vocal expression more articulate.