Home > Apps

What is AI voice cloning? Explained 

What is AI voice cloning? Delve into the technology which originated hundreds of years ago, only now perfected by AI.
Last Updated on August 1, 2023
AI voice cloning
PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Read More
You can trust PC Guide: Our team of experts use a combination of independent consumer research, in-depth testing where appropriate - which will be flagged as such, and market analysis when recommending products, software and services. Find out how we test here.

AI can sound just like you. Not right this second though, you’ll have to give it examples first. But why would you? One of the most interesting emergent use cases of AI tech is the ability to replicate any voice, and perform tasks for you that previously required manual dictation. Record audiobooks and podcasts, answer phone calls, even turn yourself into a chatbot like ChatGPT. So to that end, what is AI voice cloning, and how do you do it?

What is AI voice cloning?

AI voice cloning is, of course, using artificial intelligence to replicate the sound of a person’s voice. All types of AI use training data to learn how to be useful. This training data, in the case of an LLM (Large Language Model) like ChatGPT, Microsoft Bing, or Google Bard, is text data. Try “What is ChatGPT – and what is it used for?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.

Essential AI Tools

Editor’s pick
Only $0.00019 per word!

Content Guardian – AI Content Checker – One-click, Eight Checks

8 Market leading AI Content Checkers in ONE click. The only 8-in-1 AI content detector platform in the world. We integrate with leading AI content detectors to give unparalleled confidence that your content appear to be written by a human.
Only $0.01 per 100 words

Originality AI detector

Originality.AI Is The Most Accurate AI Detection.Across a testing data set of 1200 data samples it achieved an accuracy of 96% while its closest competitor achieved only 35%. Useful Chrome extension. Detects across emails, Google Docs, and websites.
EXCLUSIVE DEAL 10,000 free bonus credits

Jasper AI

On-brand AI content wherever you create. 100,000+ customers creating real content with Jasper. One AI tool, all the best models.
TRY FOR FREE

WordAI

10x Your Content Output With AI. Key features – No duplicate content, full control, in built AI content checker. Free trial available.
TRY FOR FREE

Copy.ai

Experience the full power of an AI content generator that delivers premium results in seconds. 8 million users enjoy writing blogs 10x faster, effortlessly creating higher converting social media posts or writing more engaging emails. Sign up for a free trial.

In the case of text-to-speech AI like ElevenLabs, this is a combination of text and audio data. For example, by learning that the input prompt “Bang” corresponds to an output audio file with a high-amplitude transient waveform, the neural network ‘learns’ what a bang sounds like. This is also how AI learns what words sound like, and what your voice sounds like. However, there is a distinction to be draw between voice cloning and speech synthesis.

In voice cloning, you replicate the sound of a specific voice. In speech synthesis, you already have the voice, and are now using it to replicate the sound of specific words. The former requires high-quality audio recordings of the individual, which a deep neural network will map to it’s text (or token) equivalent. This creates a bespoke generative AI model with which you can output words you never even spoke in the training data. The sound of your voice is the priority. In the latter, the sound of the voice is second priority; you’ll probably have multiple to choose from, but they are fungible for your purposes. You want a specific set of words read aloud, and the process of having an AI model do so is called speech synthesis.

Where do the AI voices come from?

You’ve heard speech synthesis more commonly referred to as text-to-speech (TTS). Historically TTS has had a bad rap for robotic and monotone synthetic voices, lacking emotion and intonation. To explain this, it’s important to understand the history of speech synthesis. It existed before AI, and the new technology can fix this long-standing issue.

The first confirmed mechanical recreation of the human voice was in 1779. German-born engineer Christian Gottlieb Kratzenstein created a bellows-operated machine which expelled air through chambers designed to physically replicate the human vocal tract. It was, however, limited to the long vowel sounds a, e, i, o, and u.

The first digital speech synthesis systems originated in the 1900’s, largely with Bell Labs at the frontier. The 1930’s saw the invention of the vocoder, and then in 1961 the first computer system used to synthesize speech – the IBM 704. I’m telling you this to demonstrate that speech synthesis is not the exclusive luxury of an AI-powered world. The robotic TTS you’re used to was based on outdated techniques, and there’s something to be excited about here.

Thanks to AI technology, recent voice cloning work allows for realistic voices. The difference is in the nuance – the accent and inflection in your voice is what we humans consider our speaking style. AI makes this possible.

Benefits of voice cloning

Siri and Alexa, created by Apple and Amazon respectively, are just two examples of this technology done right. Game developers are using this technology to drastically increase the realism of chat interaction in-game.

What are the risks of AI voice cloning

This technology doesn’t merely enable you to clone your own voice. You have the ability to clone the voice of any public figure. Think of an actor, a musician, a CEO – you can make them say anything you want. Of course, for celebrities, politicians, and content creators, this poses a significant risk.

These public figures are in the public perception game. If a convincing-enough deepfaked recording of them saying something they shouldn’t have were released to the public, it could cause instant and irreparable damage to their career. Thanks to social media, this spoofed voice recording could be uploaded today and be heard a million times by tomorrow. Uploaded and downloaded thousands of time across various platforms, never to be deleted with any finality.

Is AI voice cloning legal?

By the time it gets disproven, people have cancelled tickets, changed opinions, unsubscribed, cut ties, and many won’t hear that it got disproven at all. This could potentially give rise to legal retaliation, although whether this would be considered defamation or slander (considering it’s in the defendants own voice) has yet to see precedent.

In addition to these misinformation concerns, there are privacy concerns and the matter of consent. The risk of fraud and impersonation as a result of this technology is unprecedented. The legality of the process of cloning a human voice, and the technology that enables it, and how an individual chooses to use that voice, are all separate considerations. For the time being, voice cloning is legal. However, this is an exploding industry with evolving use cases and embryonic legislation, so don’t expect the rules to stay the same for long.

Conclusion

In summary, AI voice cloning is a high-risk high-reward technology. Much work has yet to be done regarding legislation surrounding its creation, and use. However, it’s here to stay, and the benefits are limited only by your imagination – and having a decent microphone.

Steve is an AI Content Writer for PC Guide, writing about all things artificial intelligence. He currently leads the AI reviews on the website.