What if AI didn’t just respond to your questions but observed, anticipated, and acted on your behalf? That’s the leap Google’s Gemini team is working on. With a model that sees the world across text, images, video, and beyond, Gemini is pushing past reactive assistants into the realm of intuitive collaborators.
In a recent episode of Google AI’s Release Notes, Gemini Model Behavior Product Lead Ani Baddepudi joined host and AI Studio Group Project Manager Logan Kilpatrick in a video interview unpacking how Gemini’s native multimodal design is enabling the shift from passive to proactive AI, and what it means for the future of human-computer interaction.
At the core of Gemini’s evolution is a deliberate decision: to build the model from day one to understand the world the way humans do, through multiple senses, all at once.
This isn’t a patchwork of plug-ins. Gemini was trained on tightly interleaved sequences of text, images, audio, and video, all mapped into a shared token space as a unified representation that allows it to reason fluidly across modalities.
That’s what makes it powerfully out of the box: not just seeing, but interpreting. Watching your screen, matching it to context, and helping, even before you ask.
“Imagine you had…an expert human looking over your shoulder and seeing what you can see, and helping you with things,” said Baddepudi.
Gemini Moves Beyond Prompts and Into Observation
AI-powered tools like Google’s Gemini are seeing surging usage as shoppers turn to virtual assistants to find Prime Day deals faster.
Solen Feyissa/Unsplash
Kilpatrick offered a glimpse into how users increasingly want their AI to behave:
“I’d like to just write, ‘Here are the things that…you could do for me… Take action on these whenever this thing happens on the screen,'” commented Kilpatrick. “Like, I get an error in my terminal. I kind of want the model to just go and…find a bunch of stuff and give me suggestions to fix without me having to actually talk to the model.”
That’s exactly the vision: Gemini doesn’t need a prompt to be useful. It can simply observe what’s happening—whether that’s code in your terminal or a pot boiling on the stove—and proactively step in.
“One example that I’ve used Gemini Live for is…I was cooking and…previously I would’ve had to follow a step-by-step recipe and try and pattern match what I’m doing to the recipe and, more often than not, it doesn’t turn out exactly like what’s in the recipe. Something cool that Gemini can do is it looks at what you’re doing as you’re doing it, and then proactively, based on visual cues in the video, suggests things to do. I was boiling pasta…It was like, ‘Add the pasta now.'”
The interaction didn’t begin with a request. Gemini watched, understood the situation visually, and suggested the right next move, like a real-time assistant that sees what you’re doing and knows what should happen next.
Multimodal by Design, Not as an Afterthought
To support this kind of intelligence, the team reworked the way video is processed. By optimizing frame tokenization—cutting it from 256 to 64 tokens per frame—Gemini can now process up to 6 hours of video in a single pass, all while maintaining high reasoning quality within its 2 million-token context window.
That change opened up a wide range of vision-first applications:
- Analyze a golf swing, frame by frame
- Track objects across extended surveillance footage
- Catalog every item on a bookshelf with a simple walkthrough video
Gemini doesn’t just see images. It interprets space, motion, and structure, and can return results like:
- Bounding boxes and segmentation masks
- 3D coordinates and object relationships
- Visual answers grounded in spatial understanding.
“You can say: ‘Which drink in this fridge has the fewest calories?’ and it draws a box around the water bottle,” explained Baddepudi.
The team is now pushing Gemini toward more intuitive, personality-driven AI. That means:
- Recognizing implied intent without explicit prompts
- Evolving interfaces from static text to dynamic, rich visuals
- Developing AI agents that don’t just respond but relate
It’s not just about answering questions. It’s about anticipating needs, operating across environments, and engaging users with presence and empathy.
“Everything is vision,” as Kilpatrick put it. And Gemini is being built to live in that world where your screen, your space, and your surroundings are the interface.