Voice AI

A two-minute audio clip is all you need to clone a voice; before you know it, you are listening to yourself reciting poetry in French or asking where to find a reliable cat café in Japanese. Voice AI has the potential to give so many more of us the ability to be heard in a uniquely Human mode; voice is intimate and emotive, which is also why it is so dangerous, too. This will be highly disruptive – for good and ill – in the months ahead.

AI Leadership Fiona Passantino, late September 2023

My AI Me

A later entrant into the generative creative content field is text-to-voice AI. We “clone” our voices – laying a bit of our own audio over a foundational model, online, and then hooking it up to an LL text generative engine to allow it to say, convincingly, whatever it is you type in.

It took me about one hour to pattern my voice, generate the sample, upload and create my AI me (I call her “FionAI”). The next thing you know, I am convincingly speaking Italian, German and even Japanese.

What truly blew my mind wasn’t just how closely my AI-me resembled the sound of my voice, but how it was able to mimic how I put words together: colour, accent, tone, cadence, pacing and even certain ways I express myself that truly did sound like me.

All it took was a basic paid subscription to ElevenLabs and a bit of time and practice and I was already in a position where I could technically put my Sennheiser away and never record another podcast again.

Audio AI Use Case

In a perfect world, text-to-voice technology will democratise audio communication, making the medium more accessible to all. It can allow people severely impacted by autism, rendered speechless, people who are deaf or hard of hearing the chance to be heard. For people who are blind or unable to read, it can be used to dictate emails, read or write reports and even create presentations or training material.

The world being imperfect, however, this technology is already widely being used to extort money from parents. A dad will get an urgent call from his daughter, sounding like her, asking for money, now, as she’s in trouble. It’s of course an AI clone, built from a few seconds of her TikTok feed. This is real, and spreading; there are so many versions of this scam already out there.

Provided you have a few minutes of clear audio of a person speaking, voice cloning can bring back the dead. Voices of historical figures can be reanimated to narrate documentaries, lead you through an adventure game or read you a story at night. It still sounds a bit robotic for now, but gets better with every iteration.

Back to the office environment; the mode of audio is a uniquely intimate and engaging form of communication. We feel a closer connection with a speaker than we do with a writer, mostly because we imagine we are sitting in a room with them, one on one. Now, imagine how close you might feel to your CEO if she is telling you the company news in 5 minutes on a weekly basis, in your native language, addressing the topics you find interesting. What would that mean for our engagement?

The Nuts and Bolts

How does it work, and why has it experienced such a sudden explosion of capability? At the root of it all is text-to-speech (TTS) synthesis technology, powered by an AI LLM laser-focused on speech nuance, and natural language understanding rather than translation.

The AI learns the speaker’s intent, rather than the pure meaning of a word, as well as studying and applying how words and phrases relate to one another to simulate a natural voice.

A two-minute audio clip is all you need to clone a voice. There is so much data captured in those two minutes that the AI is able to tokenise and extrapolate on its own, combining your voice patterns with its own, baked-in foundational models.

An investment of $5 will allow you to create up to ten custom voices at ElevenLabs or, if that’s too pricey, go to PlayHT to clone your voice for free. Your audio can say anything you like, provided it’s clear, with no background noise, sirens or dogs barking in the background. There are no strange sequence of magic words, no foundational training required.

Going Forward

What will all this mean to us, as Humans, going forward? For one thing, don’t trust anything you hear, no matter how convincing. If there’s a layer of digital separating you from another Human, it can be fake, whether it’s video, audio text or image.

Freelance translators, dubbers, interpreters, voice actors… so many creative professions will feel the impact of text-to-voice AI due to the financial pressures on production house these days, and may well find themselves redundant from one day to the next.

For now, the Human version is still superior, but as the technology improves and as it spreads to other countries and industries, it will be impossible to hear the difference.

Like all things AI, the coming transformation will be a mixed bag of good and evil, depending on our Human capacity to behave, our values and our willingness to embrace it and the limits of our own imaginations.

No eyeballs to read or watch? Just listen!

Listen on APPLE PODCAST
Listen on SPOTIFY

Voice AI

About Fiona Passantino

Fiona is an AI Integration Specialist, coming at it from the Human approach; via Culture, Engagement and Communications. She is a frequent speaker, workshop facilitator and trainer.

Fiona helps leaders and teams engage, inspire and connect; empowered through our new technologies, to bring our best selves to work. She is a speaker, facilitator, trainer, executive coach, podcaster blogger, YouTuber and the author of the Comic Books for Executives series. Her next book, “AI-Powered”, is coming soon.

Inspired. Engaged. AI-Powered.

No eyeballs to read or watch? Just listen!

About Fiona Passantino

Recent Posts

Recent Comments

Where AI + Human meet.