Large language models: revolutionizing AI dubbing and beyond

Large language models: revolutionizing AI dubbing and beyond

In late November 2022, OpenAI released ChatGPT to the public, and it quickly became the fastest-ever product to reach 1 million users. As well as its groundbreaking user acquisition, Chat GPT introduced large language models (LLMs) to the mainstream, and they have since entered the vernacular as a technology with the potential to make science fiction-level innovation a reality.

Large language models are ushering in a new era in artificial intelligence, and their impact on the localization industry is nothing short of revolutionary. At Papercup, we’re harnessing their power to transform the landscape of AI dubbing by rapidly delivering high-quality dubbed videos to our customers.

But what exactly are LLMs? How do they work? And how exactly do we use the technology here at Papercup? Let’s get into it.

What are large language models?

Large language models are artificial intelligence systems that process, understand, and generate natural language. Many companies have implemented large language models at various stages to enhance their natural language processing capabilities. Alongside recent developments in machine learning, these advances are transforming industries, among them, dubbing. These models are trained on massive datasets of written material, allowing them to capture the intricacies of language, from grammar and syntax to context and meaning, and apply this knowledge to a wide range of language tasks.

At their core, LLMs are sophisticated prediction engines. Stephen Wolfram, founder of Wolfram Research and author of What Is ChatGPT Doing … and Why Does It Work?, explains that ChatGPT aims to produce a “reasonable continuation” of given text, where “reasonable” means what one might expect based on having read billions of web pages.

In other words, based on what has been written in the past, LLMs predict what text is most likely to follow or, more accurately, which tokens (parts of words or phrases) are most likely to follow. This process is similar to the autocomplete feature on your smartphone but much more powerful. Instead of just predicting the next word, large language models can generate entire paragraphs of coherent text, complex code, or even creative stories that flow naturally from a given text input, known as a prompt.

This ability to generate compelling and contextually appropriate text makes LLMs groundbreaking and disruptive, considering their potential applications to the localization industry. They don’t simply regurgitate memorized text; instead, they understand patterns and relationships in language at a deep level, allowing them to produce original content that often appears indistinguishable from human-authored text.

This represents a major shift from traditional coding: instead of writing code in a specific programming syntax; we can now provide natural language instructions to AI models that, based on their training, can respond by performing complex tasks. This versatility has sparked new roles, from prompt engineers who specialize in crafting text instructions to AI engineers who architect multiple models to harness AI’s ability to execute complex tasks and deliver cohesive results at scale. By carefully building with these tools, organizations can harness the power of LLMs for various applications, from creative writing and code generation to, as we’ll explore, revolutionizing the AI dubbing process.

Ok, cool, but what are LLMs used for, and how does Papercup use them for AI dubbing?

In the localization industry, dubbing can be expensive, labor-intensive, and time-consuming. Now, state-of-the-art LLMs allow AI companies to leverage these powerful models to streamline and enhance various aspects of daily work.

At Papercup, we’ve developed a context-aware scene translation system that leverages LLMs to revolutionize translation in the AI dubbing process. A bit of a mouthful, so let’s break it down.

How do LLMs help with translation and dubbing?

Extracting context: We utilize the multi-modal capabilities of LLMs to build a comprehensive context that is used to contextualize a given translation task. This involves:

Generating concise summaries of the entire content or specific scenes
Extracting key terminology
Identifying the content’s style, tone, and genre
Recognizing visual and auditory cues that influence translation

By considering these various elements, our system can make more informed translation decisions, much like a human translator.

What are LLMs’ multi-modal capabilities

Traditionally, LLMs were designed to handle text data exclusively, but with advances in AI, multi-modal LLMs have emerged that can understand and generate content across different forms of media, including text but also images, videos, and speech.

2. Creating translation guides and glossaries: We create bespoke translation guides for each customer based on their requirements and by analysing their content. These guides serve a dual purpose:

They provide direction to our AI translation system, ensuring consistency and compliance with client-specific requirements
They assist our human-in-the-loop team in making appropriate corrections and any improvements to the quality and suitability of our translations

Translation and transcreation: With the contextual information and translation guides in place, LLMs also perform the actual translation task. This process is constrained by our established instructions and guidelines, similar to how a translator and script adaptor consider various factors when aiming for lip sync. The LLMs can:

Generate translations that capture the nuance and intent of the original content
Attempt phonetic synchronization
Dynamically retranslate speech if it doesn’t fit or flow appropriately
Propose alternative phrasings that maintain the original meaning while fitting the timing constraints of dubbing

Cultural and dubbing adaptation: Our system goes beyond mere translation by:

Proposing culturally relevant translations to ensure appropriateness for different target audiences
Improving synchronization by generating translations that are isometric or, in other words, translations that have a similar structure, length, or form as the source text, particularly in terms of the number of syllables
Suggesting adaptations that maintain the spirit of the original while resonating with the target culture

Multiple generations: Our system generates multiple versions of each translation to ensure the highest quality output. This allows our human-in-the-loop team to review and select the best option, considering accuracy, naturalness, and lip sync.

By combining these individual tasks, we can generate a first-cut translation and dubbed video that is a solid foundation for our human-in-the-loop team to build on. This initial output can be refined further to meet our customers’ specific quality expectations.

Challenges and ongoing improvements

While our LLM-powered system has significantly enhanced our dubbing process for both our human-in-the-loop team and customers, it’s important to acknowledge that these systems still face certain challenges:

1. Hallucinations: LLMs can sometimes generate plausible-sounding but incorrect information. In the context of dubbing, this could lead to mistranslations or the introduction of concepts not present in the original content. When it comes to factual content, the accuracy of the translation is of the utmost importance. We address this through:

- Prompt engineering and grounding the translation (supplying the LLM with everything it needs to generate a response without hallucinating) with as much context as necessary.
- Our human-in-the-loop team reviews each translation and generated output referencing the source material to ensure the highest quality

2. Consistency across long-form content: Maintaining consistency in terminology, character voices, and story elements across long videos or series can be challenging for LLMs. We tackle this by:

Implementing memory mechanisms that track important elements throughout the content
Creating comprehensive glossaries and character profiles for each project
Utilizing our human-in-the-loop process to ensure continuity

3. Handling specialized content: Some genres or industries require deep, specialized knowledge that may be beyond the general training of LLMs. To address this, we:

Fine-tune our models on domain-specific data when necessary
Collaborate closely with subject matter experts to validate translations in specialized fields

4. Emotional nuance and tone: Capturing the subtle emotional nuances and tonal shifts in dialogue can be challenging for AI systems. We’re continually improving our approach by:

Incorporating sentiment analysis to understand the true intent of the source into our context extraction process
Training our models to recognize and replicate various emotional states in different languages
Leveraging human expertise to fine-tune emotional delivery in the final dub

5. Ethical considerations: As with any AI system, there are ethical considerations regarding data usage, potential biases, and the impact on human translators. We address these by:

Ensuring transparent data practices and obtaining necessary permissions
Actively working to identify and mitigate biases in our models
Viewing our AI system as a tool to enhance, rather than replace, human creativity and expertise in the dubbing process

By acknowledging and actively addressing these challenges, we continue to refine our LLM-powered dubbing system, pushing the boundaries of what’s possible in AI-assisted localization while maintaining the high quality standards that our clients expect.

What is the future of LLMs at Papercup and beyond?

As we look to the future, the potential of LLMs in localization are enormous. From hyper-personalized content adaptation to real-time, multi-modal translation systems, we’re on the brink of seeing technologies that will make language barriers a thing of the past and make i possible to achieve our mission of “making the world’s videos watchable in any language”. The prospect of instant, high-quality dubbing has the power to democratize content, allowing ideas and stories to flow freely across linguistic and cultural boundaries.

However, as we work on these groundbreaking developments, we are acutely aware of the challenges and responsibilities that come with such powerful technology. Issues of ethical AI use, the preservation of cultural nuances, and the role of human expertize in the localization process will continue to be at the forefront of our work.

Conclusion: the transformative power of LLMs in localization

By replicating the nuanced process of human translation—from extracting context and creating custom guides to performing culturally sensitive adaptations—our AI can rapidly produce high-quality, contextually appropriate dubbing at scale. This system opens up new possibilities for content owners to reach global audiences more effectively and efficiently.

As powerful as our LLM-powered dubbing systems are, our human-in-the-loop process remains critical to achieving the accuracy and quality our customers and their audiences expect. That is why, at Papercup, we believe the future of localization is a blend of cutting-edge AI technology and human creativity and insight.

The possibilities of LLMs in localization are as vast as language itself. We’re excited to be at the forefront of this transformation, driving forward into a future where every voice can be heard, understood, and appreciated across the globe.

To find out how AI dubbing can transform your business, drop us a line.