Home > Apps

Google DeepMind’s RT-2 AI can control robots

RT-2 is the latest leap in AI robotics. What is Google DeepMind working on behind closed doors, and what does RT-2 mean for our future?
Last Updated on August 3, 2023
Google DeepMind's RT-2 AI
PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Read More
You can trust PC Guide: Our team of experts use a combination of independent consumer research, in-depth testing where appropriate - which will be flagged as such, and market analysis when recommending products, software and services. Find out how we test here.

The boundaries between digital and physical, science and science fiction, continue to blur. Google DeepMinds’s latest development in AI robotics sees the RT-2 model bridge that gap further, allowing robots to learn to move by themselves.

What is Google DeepMind’s RT-2 AI model?

Google DeepMind’s RT-2 is an AI model that allows robots to follow instructions (written in natural human language) that they were never explicitly programmed to perform.

It follows predecessor RT-1 which already broke ground in robotic control research. RT-1 showed how robots can learn from each other using Transformers, like GPT-4 (Generative Pre-trained Transformer 4). Try “What is ChatGPT – and what is it used for?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.

Essential AI Tools

Editor’s pick
Only $0.00019 per word!

Content Guardian – AI Content Checker – One-click, Eight Checks

8 Market leading AI Content Checkers in ONE click. The only 8-in-1 AI content detector platform in the world. We integrate with leading AI content detectors to give unparalleled confidence that your content appear to be written by a human.
EXCLUSIVE DEAL 10,000 free bonus credits

Jasper AI

On-brand AI content wherever you create. 100,000+ customers creating real content with Jasper. One AI tool, all the best models.


10x Your Content Output With AI. Key features – No duplicate content, full control, in built AI content checker. Free trial available.


Experience the full power of an AI content generator that delivers premium results in seconds. 8 million users enjoy writing blogs 10x faster, effortlessly creating higher converting social media posts or writing more engaging emails. Sign up for a free trial.


Create SEO-optimized and plagiarism-free content for your blogs, ads, emails, and website 10X faster. Start for free. No credit card required.

DeepMind is calling it a vision-language-action (VLA) model, which allows it to interpret data from the internet, generalize it, and apply anything relevant to it’s own control parameters – effectively generating general-purpose instructions for real world scenarios it was never programmed to handle.

RT-2 itself stands for Robotics Transformer 2 and is based on the same machine learning concept as OpenAI’s GPT-4 – but is not purely an LLM (Large Language Model). The transformer concept describes a neural network in which the meanings and relations of the instructions it receives are tokenised, similarly to how colours are tokenised into hex codes for the benefit of a digital system. The concept is general-purpose and can be applied not only to natural language generation, but speech recognition, object classification, contextual awareness, and navigation of real-world environments.

Successful multi-task demonstrations have been observed on multiple models. Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E), both prior Google projects, were used to power the robot. This is interesting considering that PaLI-X was not developed with robotics control in mind. The results found PaLM-E more reliable for mathematical reasoning than the “mostly visually pre-trained PaLI-X”.

RT-2: Vision-Language-Action Models

In the research paper “RT-2: Vision-Language-Action Models“, the AI division explain how “RT-2 can exhibit signs of chain-of-thought reasoning similarly to vision-language models.” This multi-stage semantic reasoning shows that RT-2 “is able to answer more sophisticated commands due to the fact that it is given a place to plan its actions in natural language first.

This is a promising direction that provides some initial evidence that using LLMs or VLMs as planners can be combined with low-level policies in a single VLA model.”

In the 52-authored research paper, DeepMind asserts that “RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”

Google's DeepMind RT-2 performing natural language instructions
Google’s DeepMind RT-2 performing natural language instructions

This all to say, Vision-Language-Action models transfer web knowledge to robotic control. In other words, type a simple prompt as you would with chatbots like ChatGPT and the robot will follow those instructions. It will take those abstract tasks like “move the blue block to the mustard” and generate the coordinates necessary to make that happen in real time.

Video showing this RT-2 powered robot working can be found here.

Can Google’s new robot program itself? 

In a sense, yes. The goal here is a for a general purpose robot that requires less programming and affords more versatility than any before it.

As Google itself puts it, DeepMind is studying “how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.”

“In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens.” and the result is the ability to recognise objects not present in the training data.

Steve is an AI Content Writer for PC Guide, writing about all things artificial intelligence. He currently leads the AI reviews on the website.