Decoding voice cloning for AI dubbing: the most common questions answered

Voice cloning, often referred to as deepfake voices or synthetic voices, uses machine learning models to copy the characteristics of an individual’s voice and create a synthetic version that sounds exactly like the original. In AI dubbing, the cloned voice is used to narrate content in new languages while retaining the intonation, cadence, and general speaking style of the original speaker. While voice cloning raises ethical issues around voice authenticity, ownership, and permission, it’s not inherently unethical but does require careful safeguarding measures.

At Papercup, we dub videos using AI for some of the world’s biggest digital publishers — Bloomberg and Insider, and content owners — Fremantle, Fuse Media, and Jamie Oliver. The media buzz around AI voice cloning and deepfake voices has generated a lot of interest. Here are the top questions on the subject from media companies, digital publishers, and content creators.

How does voice cloning work?

Voice cloning uses AI to replicate the speaking style of the original speaker. This process involves multiple steps, including training a custom machine learning model with data from the original speaker to ensure the traits of their voice are recreated accurately, producing a high-quality clone.

It’s imperative when cloning someone’s voice to obtain explicit permission from the person whose voice you’re looking to recreate. This is critical to safeguard their personal voice data and avoid legal entanglements and ethical dilemmas.

How much data is needed to clone a voice?

The quality of data used is important as it determines the quality of the cloned voice. Usually, to establish a bespoke machine learning model capable of generating a premium voice clone, it’s necessary to provide an hour’s worth of clean, isolated audio of the speaker.

At Papercup, we think it’s crucial for all AI companies, ourselves included, to ensure AI is put to good use, to monitor its application closely, and to make certain that voice data is utilized only with clear consent. Read our ethical pledge here.

What type of content is voice cloning good for?

Voice cloning suits any genre, but keep in mind — the more speakers, the more permissions required. Big YouTube channel owners, who are the main faces of their channels, often consider voice cloning to maintain their voice in various languages and create brand consistency, which ultimately increases ROI.

However, we have seen some of the top creators in the business use a different voice to dub all their content. While not their own, this voice is identifiable by their global audiences and arguably builds the same affinity, especially as audiences outside of the original market aren’t always familiar with the content creator’s voice in English.

What are the alternatives to voice cloning?

Voice cloning is just one method of creating AI-powered, human-like speech. An alternative is using text-to-speech from broad datasets, not tied to one individual, to create synthetic voices that still sound entirely human.

At Papercup, we use multiple datasets optimized for video content to create a library of highly expressive AI voices. We’re able to create voices that don’t just sound human but also embody elements of the original speaker’s speaking style — emotion, energy, style. This option is more cost-effective, faster, and more scalable than voice cloning.

The below clip from Bruce Almighty is not voiced cloned but instead features Papercup’s speech-to-speech approach to creating synthetic voices, which captures the emotions of the original speaker and source language.

Do audiences prefer voice cloning?

Audiences prioritize authenticity over the exact replication of the voice. They tend to care more about dubbing that sounds realistic over dubbing that sounds exactly the same as the original speaker.

We asked 300 people across Latin America to rank what’s most important to them when watching dubbed content. 50% prioritized ‘translation accuracy’ and ‘how realistic the voice sounds,’ whereas ‘the voice match of the original’ was deemed least important.

At Papercup, we view voice cloning and AI dubbing through the lens of a pyramid; crucial elements like translation accuracy, speech intelligibility, and the preservation of intonation and pacing are essential for audience engagement. There’s very little point in cloning a voice if these elements aren’t perfect because it will actively damage audience engagement. A synthetic voice with high translation accuracy is more appealing to audiences than a cloned voice with inaccurate translation.

Get in touch to learn more about how we produce realistic AI dubbing for the world’s biggest media companies and creators.