Crafting AI voices that exceed audience expectations

If you’re in the business of media and entertainment, you’re likely to be hit up by a new AI dubbing company every week, which makes it tricky to navigate the market.
Some of these new companies are doing revolutionary things but, truth be told – the recent flood of companies offering voice generation, voice cloning and AI dubbing is driven in no small part by the ease with which off-the-shelf or pre-made voices, intended for home assistants, can be pulled into new products via APIs. To better understand the quality of voices a speech technology company is producing, start by asking these questions.
What is good dubbing?

Here at Papercup, we build our AI voices around what matters most to viewers when watching dubbed content. To keep us true to that aim, we’ve created our Hierarchy of AI Dubbing; this framework ensures that everything we work on is geared towards deepening audience engagement and ensuring the experience of viewing dubbed content exceeds their expectations.

First, some theory – back in 1948, there was a guy called Abraham Maslow. He proposed in his ‘Hierarchy of Needs’ theory that if humans’ basic needs for sustenance shelter and rest are met, they will be motivated to fulfill “higher tier” needs like intimacy, belonging and accomplishment with the ultimate goal of self-actualisation (achieving one’s full potential) – be that creatively, financially or professionally. Bear with us, we’re about to segue to AI dubbing.

At Papercup, we think of AI dubbing as a pyramid in the same way that Maslow did human needs. In our case, some foundational elements have to be well executed before progressing to desirable but less essential tiers. These foundation must-haves are include the ability of the dubbed content to convey the correct meaning (translation accuracy), the ability of the voices to be understood (the intelligibility of the speech), the ability to reflect nuanced intonation and pacing (avoiding monotony).

The top of the pyramid (or AI dubbing’s full potential) might look like automatically generated, lip-synced, voice clones (an exact copy of someone’s voice) that can engage audiences no matter what content type (film, news, documentary) they appear in.

It’s tempting for users of AI dubbing to immediately seek to fulfil more desirable needs (or in this case wants) like voice cloning and lip syncing. However just as in Maslow’s theory, it’s virtually impossible to do so without first ensuring the foundational elements are well executed.

In other words, it’s critical to get the foundational elements of AI dubbing right before moving up the pyramid to what could be described as inessential desirables like voice cloning and lip syncing. Here’s why.

What happens when AI dubbing fundamentals are not met?

By translation accuracy, we mean a translation that is error-free and reflects cultural, topic-specific or brand language of the original audio. If you have a voice clone (a voice produced by deep learning that sounds exactly like the original speaker) narrating a mistake-ridden Spanish translation, you will instantly lose your audience.

When we refer to the intelligibility of the voices, we refer to their clarity or precision: so how easily the speech can be comprehended. If you have a voice clone speaking flawless Spanish incomprehensibly (for example with the wrong intonation, say – delivering questions as statements), you will lose your audience.

The emotional depth of the voices is how lifelike they sound. Can they capture the intonation and vibrancy of real human speech? Your voice clone narrates a flawless Spanish translation and is fully comprehensible but its delivery is robotic and lifeless. You guessed it, you lose your audience!

At Papercup, we’re single-minded about getting translation, intelligibility and expressivity right first. We work with customers like Bloomberg and Insider with global brand equity that cannot afford to be diminished by poor-standard fundamentals. Being a market leader in these areas allows us to grow customers’ global audiences while working on the desirable but largely inessential product features that may become core in the future.

We tested our theory with global audiences

Theories are all very well, but how does ours stack up against real-world scenarios? We asked 300 people in Latin America who watched dubbed content at least five times per week to rank what’s most important to them when watching dubbed content.

More than 50% of the respondents said that the ‘translation accuracy’ and how ‘realistic the voice sounds’ is most important. 60% of people ranked ‘the new dubbed content has the exact same voice as the original’ as either least or second least important to them.

The data from our survey chimes with the qualitative data we see on social channels when automated tools have been used to poorly dub video content. When audiences take to the comments section to leave feedback on poor dubs, their commentary focuses on the poor quality of the translation and their ability to comprehend the dubbed audio.

As one moves up Maslow’s pyramid to more complex needs, his theory is influenced by cultural nuances and circumstances. Reaching one’s full potentIal in an economic downturn, for example, might look like keeping your job, rather than being promoted.

Circumstantial influences apply to AI dubbing too. For an owner of theatrical content featuring famous actors, only a very specific type of dubbing may work. For instance: using voice actors known for being the voice of a famous actor in another language, like Spanish James Bond.

Whatever the external circumstances, however, Maslow and Papercup share their belief in an inalienable truth: if foundational needs aren’t met, progress can be made.