A recent study by Amazon finds the alignment between speakers’ lips and dubbed audio is less relevant to high-quality dubbing than previously reported. The research – the first large-scale, data-driven analysis of its kind – found expressive, accurate, and natural-sounding output was prioritized by professional dubbing artists.
The study discovered that on screen mouth movements matched dubbed audio only 12.4% of the time. In short, dubbing artists were unwilling to reduce translation accuracy to make dubbing dialogue and on-screen movements align.
Reflecting on what this means for automatic dubbing, the team involved stated that, “research on automatic lip-sync can be marginally useful, at best, for automatic dubbing.” They concluded that more focus was needed on vocal expressivity instead of automated lip-sync efforts. Commenting on the findings, James Leoni, Papercup’s senior machine learning engineer says:
“This study further confirms what we hear consistently both from our customers and our internal listening tests: expressivity is paramount to creating a high quality AI dubbed video. This is why most of our machine learning team at Papercup focuses on creating the world’s most expressive synthetic voices. One of our main research themes is to better model conversation that is dense and rich with the emotions and attitudes that arise in this style of speaking. Watch the video below to hear the levels of expressivity we’re achieving with our synthetic voices.”
“It’s important to note that AI dubbing providers that offer 30+ languages cannot simultaneously improve the expressivity, demanded by audiences and akin to that offered by human dubbing artists, across such a wide range of languages. Instead these AI dubbing providers integrate third party text-to-speech APIs which offer only a basic sounding speech.
As well as a focus on expressivity, Papercup employs a team of qualified human translators who verify the generated output to ensure that it is error-free. This process ensures that we maintain the fidelity of translation and expressivity.
The Amazon study breaks new ground
To date, there has been no large-scale data-driven analysis of human dubbing: research on dubbing has mostly been restricted to qualitative analysis. As a result, insights into what is important for dubbing success have remained theoretical rather than data-driven.
Amazon’s research changes this. The team behind the study believes their findings could give machine learning engineers a clearer idea of how competing priorities should be addressed in automatic dubbing. For content owners – who tend to place huge emphasis on lip syncing – the study vindicates prioritizing human-sounding audio when localizing content using generative AI.
54 Amazon TV shows analysed
Amazon researchers assessed how “competing interests” were balanced in the human dubbing process. They wanted to understand the relative importance of factors such as:
- Producing emotion and emphasis
- Providing an accurate translation
- Aligning dialogue length with the length of time speakers’ mouths moved
- Aligning dubbed words with the movement of speakers’ lips.
To do so, they compared original audio and visual materials from Amazon Studios’ existing catalog of dubbed TV shows to the end dubbing product. In total, the team analyzed 319.57 hours of video from 54 professionally produced titles. The shows assessed were all originally recorded in English. The bulk of the analysis focussed on “a subset of 35.68 hours of content with both Spanish and German dubs”.
Since performing well in one of these areas often meant sacrificing quality in another, the team wanted to discover which factors the complex and highly collaborative human dubbing process tacitly prioritized.
Key findings
Researchers found alignment constraints were “frequently violated by human dubbers”. In other words, dubbers prioritized appropriate emotion and emphasis over that which lined up with on-screen movements, when forced to pick between the two. Voice actors were unwilling to vary their speaking rates to meet other constraints.
It therefore concluded vocal expressivity is more important than alignment in the human dubbing process.
The results suggest a tacit understanding among dubbers of the low importance of lip-syncing. Earlier research on eye movements cited by the study could explain why – it found audiences automatically overlooked mismatches between on-screen movements and dialogue when they were watching dubbed content.
Given its low implementation, the study concluded lip-syncing was over-valued and shouldn’t be a focus for AI research.
What this means for AI dubbing
The importance of expressivity was so clear that the researchers involved believe AI dubbing needs “a mechanism to encode emotion” to achieve a similar impact.
Amazon’s study suggests that, contrary to commonly held beliefs, lip sync matters little when transcreating content. Instead, AI dubbing should prioritize natural-sounding dialogue and accurate translation. In particular, AI researchers must find ways to encode expressivity to produce a persuasive end product.
Learn more about localizing video with synthetic voices.