Google DeepMind Creates Gemini AI-Integrated Smarter Robots
Google DeepMind on Thursday shared new developments in robotics and vision language models (VLMs). The tech giant’s artificial intelligence (AI) research arm has been working with advanced vision models to develop new capabilities in robots. In a new study, DeepMind highlighted that the use of Gemini 1.5 Pro and its long context window has now enabled the arm to make breakthroughs in the navigation and real-world understanding of its robots. Earlier this year, Nvidia also unveiled new AI technology that’s powering advanced capabilities in humanoid robots.
Google DeepMind Uses Gemini AI to Improve Robots
In a after On X (formerly known as Twitter), Google DeepMind revealed that it is training its robots using Gemini 1.5 Pro’s 2 million token context window. Context windows can be thought of as the knowledge window visible to an AI model, which allows it to process tangential information surrounding the queried topic.
For example, if a user asks an AI model for “most popular ice cream flavors,” the AI model will check the keyword ice cream and flavors to find information for that query. If this information window is too small, the AI may only respond with the names of different ice cream flavors. However, if it is larger, the AI can also see the number of articles about each ice cream flavor to find which one has been mentioned the most and derive the “popularity factor.”
DeepMind uses this long context window to train its robots in real-world environments. The division wants to see if the robot can remember the details of an environment and help users when asked about the environment using contextual or vague terms. In a video shared on Instagram, the AI division demonstrated that a robot could direct a user to a whiteboard when asked for a place to draw.
“Our robots, leveraging 1.5 Pro’s 1 million token context length, can successfully navigate a space using human-like instructions, video tours, and common sense,” Google DeepMind said in a statement.
In a study published on arXiv (a non-peer-reviewed online journal), DeepMind explained the technology behind the breakthrough. In addition to Gemini, it is also using its own Robotic Transformer 2 (RT-2) model. It is a vision-language-action (VLA) model that learns from both web and robotics data. It uses computer vision to process real-world environments and use that information to create datasets. This dataset can later be processed by the generative AI to break down contextual commands and produce desired outcomes.
Currently, Google DeepMind is using this architecture to train its robots in a broad category known as Multimodal Instruction Navigation (MIN), which includes environmental exploration and instruction-driven navigation. If the demonstration shared by the division is legitimate, this technology could further advance robotics.