...

Text-to-speech with feeling – this new AI model does everything but shed a tear


Collage of mouths, speech bubbles and grammar symbols on blue background

We Are/Getty Images

Not so long ago, generative AI could only communicate with human users via text. Now it’s increasingly being given the power of speech — and this ability is improving by the day.

On Thursday, AI voice platform ElevenLabs introduced v3, described on the company’s website as “the most expressive text-to-speech model ever.” The new model can exhibit a wide range of emotions and subtle communicative quirks — like sighs, laughter, and whispering — making its speech more humanlike than the company’s previous models. 

Also: Could WWDC be Apple’s AI turning point? Here’s what analysts are predicting

In a demo shared on X, v3 was shown generating the voices of two characters, one male and the other female, who were having a lighthearted conversation about their newfound ability to speak in more humanlike voices. 

There’s certainly none of the Alexa-esque flatness of tone, but the v3-generated voices tend to be almost excessively animated, to the point that their laughter is more creepy than charming — take a listen yourself

The model can also speak more than 70 languages, compared to its predecessor’s v2 limit of 29. It’s available now in public alpha, and its price tag has been slashed by 80% until the end of this month.

The future of AI interaction

AI-generated voice has become a major focus of innovation as tech developers look toward the future of human-machine interaction.

Automated assistants like Siri and Alexa have long been able to speak, of course, but as anyone who routinely uses these systems can attest, their voices are very mechanical, with a rather narrow range of emotional cadence and tones. They’re useful for handling quick and easy tasks, like playing a song or setting an alarm, but they don’t make great conversation partners.

Some of the latest text-to-speech (TTS) AI tools, on the other hand, have been engineered to speak in voices that are maximally realistic and engaging.

Also: You shouldn’t trust AI for therapy – here’s why

Users can prompt v3, for example, to speak in voices that are easily customizable through the use of “audio tags.” Think of these as stylistic filters that modify the output, and which can be inserted directly into text prompts: “Excited,” “Loudly,” “Sings,” “Laughing,” “Angry,” and so on.

ElevenLabs isn’t the only company racing to build more lifelike TTS models, which big tech companies are selling as a more intuitive and accessible way to interact with AI.

In late May, ElevenLabs competitor Hume AI unveiled its Empathic Voice Interface (EVI) 3 model, which allows users to generate custom voices by describing them in natural language. Similarly nuanced conversational abilities are also now on offer through Google’s Gemini 2.5 Pro Flash model.

Want more stories about AI? Sign up for Innovation, our weekly newsletter.



Source link

#Texttospeech #feeling #model #shed #tear