Last Updated on
AI can sound just like you. Not right this second though, you’ll have to give it examples first. But why would you? One of the most interesting emergent use cases of AI tech is the ability to replicate any voice, and perform tasks for you that previously required manual dictation. Record audiobooks and podcasts, answer phone calls, even turn yourself into a chatbot like ChatGPT. So to that end, what is AI voice cloning, and how do you do it?
What is AI voice cloning?
AI voice cloning is, of course, using artificial intelligence to replicate the sound of a person’s voice. All types of AI use training data to learn how to be useful. This training data, in the case of an LLM (Large Language Model) like ChatGPT, Microsoft Bing, or Google Bard, is text data. Try “What is ChatGPT – and what is it used for?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.
Essential AI Tools
Winston AI detector
Originality AI detector
In the case of text-to-speech AI like ElevenLabs, this is a combination of text and audio data. For example, by learning that the input prompt “Bang” corresponds to an output audio file with a high-amplitude transient waveform, the neural network ‘learns’ what a bang sounds like. This is also how AI learns what words sound like, and what your voice sounds like. However, there is a distinction to be draw between voice cloning and speech synthesis.
In voice cloning, you replicate the sound of a specific voice. In speech synthesis, you already have the voice, and are now using it to replicate the sound of specific words. The former requires high-quality audio recordings of the individual, which a deep neural network will map to it’s text (or token) equivalent. This creates a bespoke generative AI model with which you can output words you never even spoke in the training data. The sound of your voice is the priority. In the latter, the sound of the voice is second priority; you’ll probably have multiple to choose from, but they are fungible for your purposes. You want a specific set of words read aloud, and the process of having an AI model do so is called speech synthesis.
Where do the AI voices come from?
You’ve heard speech synthesis more commonly referred to as text-to-speech (TTS). Historically TTS has had a bad rap for robotic and monotone synthetic voices, lacking emotion and intonation. To explain this, it’s important to understand the history of speech synthesis. It existed before AI, and the new technology can fix this long-standing issue.
The first confirmed mechanical recreation of the human voice was in 1779. German-born engineer Christian Gottlieb Kratzenstein created a bellows-operated machine which expelled air through chambers designed to physically replicate the human vocal tract. It was, however, limited to the long vowel sounds a, e, i, o, and u.
The first digital speech synthesis systems originated in the 1900’s, largely with Bell Labs at the frontier. The 1930’s saw the invention of the vocoder, and then in 1961 the first computer system used to synthesize speech – the IBM 704. I’m telling you this to demonstrate that speech synthesis is not the exclusive luxury of an AI-powered world. The robotic TTS you’re used to was based on outdated techniques, and there’s something to be excited about here.
Thanks to AI technology, recent voice cloning work allows for realistic voices. The difference is in the nuance – the accent and inflection in your voice is what we humans consider our speaking style. AI makes this possible.
Benefits of voice cloning
Siri and Alexa, created by Apple and Amazon respectively, are just two examples of this technology done right. Game developers are using this technology to drastically increase the realism of chat interaction in-game.
What are the risks of AI voice cloning
This technology doesn’t merely enable you to clone your own voice. You have the ability to clone the voice of any public figure. Think of an actor, a musician, a CEO – you can make them say anything you want. Of course, for celebrities, politicians, and content creators, this poses a significant risk.
These public figures are in the public perception game. If a convincing-enough deepfaked recording of them saying something they shouldn’t have were released to the public, it could cause instant and irreparable damage to their career. Thanks to social media, this spoofed voice recording could be uploaded today and be heard a million times by tomorrow. Uploaded and downloaded thousands of time across various platforms, never to be deleted with any finality.
Is AI voice cloning legal?
By the time it gets disproven, people have cancelled tickets, changed opinions, unsubscribed, cut ties, and many won’t hear that it got disproven at all. This could potentially give rise to legal retaliation, although whether this would be considered defamation or slander (considering it’s in the defendants own voice) has yet to see precedent.
In addition to these misinformation concerns, there are privacy concerns and the matter of consent. The risk of fraud and impersonation as a result of this technology is unprecedented. The legality of the process of cloning a human voice, and the technology that enables it, and how an individual chooses to use that voice, are all separate considerations. For the time being, voice cloning is legal. However, this is an exploding industry with evolving use cases and embryonic legislation, so don’t expect the rules to stay the same for long.
In summary, AI voice cloning is a high-risk high-reward technology. Much work has yet to be done regarding legislation surrounding its creation, and use. However, it’s here to stay, and the benefits are limited only by your imagination – and having a decent microphone.