Google DeepMind’s RT-2 AI can control robots

RT-2 is the latest leap in AI robotics. What is Google DeepMind working on behind closed doors, and what does RT-2 mean for our future?

Google DeepMind's RT-2 AI

Last Updated on

The boundaries between digital and physical, science and science fiction, continue to blur. Google DeepMinds’s latest development in AI robotics sees the RT-2 model bridge that gap further, allowing robots to learn to move by themselves.

What is Google DeepMind’s RT-2 AI model?

Google DeepMind’s RT-2 is an AI model that allows robots to follow instructions (written in natural human language) that they were never explicitly programmed to perform.

It follows predecessor RT-1 which already broke ground in robotic control research. RT-1 showed how robots can learn from each other using Transformers, like GPT-4 (Generative Pre-trained Transformer 4). Try “What is ChatGPT – and what is it used for?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.

Essential AI Tools

Editor’s pick
EXCLUSIVE DEAL 10,000 free bonus credits

Jasper AI

On-brand AI content wherever you create. 100,000+ customers creating real content with Jasper. One AI tool, all the best models.
Editor’s pick

Experience the full power of an AI content generator that delivers premium results in seconds. 8 million users enjoy writing blogs 10x faster, effortlessly creating higher converting social media posts or writing more engaging emails. Sign up for a free trial.
Editor’s pick
Only $0.00015 per word!

Winston AI detector

Winston AI: The most trusted AI detector. Winston AI is the industry leading AI content detection tool to help check AI content generated with ChatGPT, GPT-4, Bard, Bing Chat, Claude, and many more LLMs.
Only $0.01 per 100 words

Originality AI detector

Originality.AI Is The Most Accurate AI Detection.Across a testing data set of 1200 data samples it achieved an accuracy of 96% while its closest competitor achieved only 35%. Useful Chrome extension. Detects across emails, Google Docs, and websites.


10x Your Content Output With AI. Key features – No duplicate content, full control, in built AI content checker. Free trial available.
*Prices are subject to change. PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Learn more

DeepMind is calling it a vision-language-action (VLA) model, which allows it to interpret data from the internet, generalize it, and apply anything relevant to it’s own control parameters – effectively generating general-purpose instructions for real world scenarios it was never programmed to handle.

RT-2 itself stands for Robotics Transformer 2 and is based on the same machine learning concept as OpenAI’s GPT-4 – but is not purely an LLM (Large Language Model). The transformer concept describes a neural network in which the meanings and relations of the instructions it receives are tokenised, similarly to how colours are tokenised into hex codes for the benefit of a digital system. The concept is general-purpose and can be applied not only to natural language generation, but speech recognition, object classification, contextual awareness, and navigation of real-world environments.

Successful multi-task demonstrations have been observed on multiple models. Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E), both prior Google projects, were used to power the robot. This is interesting considering that PaLI-X was not developed with robotics control in mind. The results found PaLM-E more reliable for mathematical reasoning than the “mostly visually pre-trained PaLI-X”.

RT-2: Vision-Language-Action Models

In the research paper “RT-2: Vision-Language-Action Models“, the AI division explain how “RT-2 can exhibit signs of chain-of-thought reasoning similarly to vision-language models.” This multi-stage semantic reasoning shows that RT-2 “is able to answer more sophisticated commands due to the fact that it is given a place to plan its actions in natural language first.

This is a promising direction that provides some initial evidence that using LLMs or VLMs as planners can be combined with low-level policies in a single VLA model.”

In the 52-authored research paper, DeepMind asserts that “RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”

Google's DeepMind RT-2 performing natural language instructions
Google’s DeepMind RT-2 performing natural language instructions

This all to say, Vision-Language-Action models transfer web knowledge to robotic control. In other words, type a simple prompt as you would with chatbots like ChatGPT and the robot will follow those instructions. It will take those abstract tasks like “move the blue block to the mustard” and generate the coordinates necessary to make that happen in real time.

Video showing this RT-2 powered robot working can be found here.

Can Google’s new robot program itself? 

In a sense, yes. The goal here is a for a general purpose robot that requires less programming and affords more versatility than any before it.

As Google itself puts it, DeepMind is studying “how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.”

“In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens.” and the result is the ability to recognise objects not present in the training data.