The boundaries between digital and physical, science and science fiction, continue to blur. Google DeepMinds’s latest development in AI robotics sees the RT-2 model bridge that gap further, allowing robots to learn to move by themselves.
What is Google DeepMind’s RT-2 AI model?
Google DeepMind’s RT-2 is an AI model that allows robots to follow instructions (written in natural human language) that they were never explicitly programmed to perform.
It follows predecessor RT-1 which already broke ground in robotic control research. RT-1 showed how robots can learn from each other using Transformers, like GPT-4 (Generative Pre-trained Transformer 4). Try “What is ChatGPT – and what is it used for?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.
Essential AI Tools
DeepMind is calling it a vision-language-action (VLA) model, which allows it to interpret data from the internet, generalize it, and apply anything relevant to it’s own control parameters – effectively generating general-purpose instructions for real world scenarios it was never programmed to handle.
RT-2 itself stands for Robotics Transformer 2 and is based on the same machine learning concept as OpenAI’s GPT-4 – but is not purely an LLM (Large Language Model). The transformer concept describes a neural network in which the meanings and relations of the instructions it receives are tokenised, similarly to how colours are tokenised into hex codes for the benefit of a digital system. The concept is general-purpose and can be applied not only to natural language generation, but speech recognition, object classification, contextual awareness, and navigation of real-world environments.
Successful multi-task demonstrations have been observed on multiple models. Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E), both prior Google projects, were used to power the robot. This is interesting considering that PaLI-X was not developed with robotics control in mind. The results found PaLM-E more reliable for mathematical reasoning than the “mostly visually pre-trained PaLI-X”.
RT-2: Vision-Language-Action Models
In the research paper “RT-2: Vision-Language-Action Models“, the AI division explain how “RT-2 can exhibit signs of chain-of-thought reasoning similarly to vision-language models.” This multi-stage semantic reasoning shows that RT-2 “is able to answer more sophisticated commands due to the fact that it is given a place to plan its actions in natural language first.
This is a promising direction that provides some initial evidence that using LLMs or VLMs as planners can be combined with low-level policies in a single VLA model.”
In the 52-authored research paper, DeepMind asserts that “RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”
This all to say, Vision-Language-Action models transfer web knowledge to robotic control. In other words, type a simple prompt as you would with chatbots like ChatGPT and the robot will follow those instructions. It will take those abstract tasks like “move the blue block to the mustard” and generate the coordinates necessary to make that happen in real time.
Video showing this RT-2 powered robot working can be found here.
Can Google’s new robot program itself?
In a sense, yes. The goal here is a for a general purpose robot that requires less programming and affords more versatility than any before it.
As Google itself puts it, DeepMind is studying “how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.”
“In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens.” and the result is the ability to recognise objects not present in the training data.