What are Q-Learning and Q*? – OpenAI’s secret AI models

What the Mira Murati letter reveals

OpenAI CTO Mira Murati, and the internal letter to employees regarding Q-Learning.

PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Prices subject to change. Read More

Last Updated on

On Wednesday, November 22nd, OpenAI CTO Mira Murati sent a letter to employees. The letter detailed a project known internally as Q* (Pronounced Q-Star) or Q-Learning. This project was purported to be “one factor among a longer list of grievances by the board leading to Altman’s firing”, and could help accelerate the learning rate of mathematical models towards AGI (Artificial General Intelligence). So, how does Q-Learning work, and what controversy (reportedly) led to the firing of OpenAI CEO Sam Altman?

OpenAI CTO Mira Murati and the internal letter to staff

Q* and Q-Learning are trending today due to references made by OpenAI’s Chief Technology Officer, Mira Murati, on Wednesday, November 22nd. It’s expected that this technology could be an ingredient for achieving AGI, or Artificial General Intelligence. As a result, a “lack of consistently candid communication” about such a world-changing development played a part in the board’s decision to fire OpenAI CEO Sam Altman, according to an internal letter sent out by Murati to OpenAI employees.

✓ Steve says

Was the writing on the wall?

There are plenty of conspiracy theories still circling as to why the OpenAI board fired Sam Altman as CEO. To date, this seems to be the most probable reason, but is still unconfirmed. It appears to fit the reasoning initially given by the board that Altman was “not consistently candid in his communications with the board”, about just how much progress was made recently with the Q-Learning algorithm.

In fact, Altman said on stage, on November 17th, that “4 times now in the history of OpenAI – the most recent time was just in the last couple of weeks – I’ve gotten to be in the room when we push the veil of ignorance back and the frontier of discovery forward”. It’s possible that this 4th breakthrough was none other than project Q*. Then again, it’s also possible this was in reference to GPT-4 Turbo or ChatGPT’s new voice capabilities.

What are project Q* and its Q-Learning algorithm?

To date, Q* and Q-Learning are being used synonymously. With very little documentation and few official references to these terms, we’re unable to definitively differentiate them. However, it’s possible that Q* is an internal project name, in reference to the optimal solution of a Bellman equation (which we’ll return to later). Q* may also be the name of a corresponding AI model yet to be announced by OpenAI, or at least a working title thereof. By contrast, Q-Learning is a mathematical concept. The Q-Learning algorithm will be a formula used in this project and AI model.

Names aside, Q-Learning refers to a formula used in a machine learning algorithm capable of “grade-school” level mathematics and is hoped to surpass OpenAI’s GPT-4 model in that field. It approaches math problems using a machine learning technique called reinforcement learning, wherein rewards are given for correct or optimal actions, and punishment is given for incorrect or suboptimal actions. Machines can learn the shortest path (shortest route) to an expected reward through exploration of all possible paths, finding a more optimal route through trial and error, and achieving an optimized state over time, making better decisions each time.

But how does this all relate to Q*? Q-values, also known as action values, allow us to put a number value on the effectiveness of a given action at a given time. Storing this value in a Q-table, alongside all other Q-values, a machine can objectively decide the effectiveness of that action, and as a result, the highest number is the most optimal solution found (so far or at a given time) by that algorithm.

Essential AI Tools

Editor’s pick

7-in-1 AI Content Checker – One-click, Seven Checks

7 Market leading AI Content Checkers in ONE click. The only 7-in-1 AI content detector platform in the world. We integrate with leading AI content detectors to give unparalleled confidence that your content appear to be written by a human.
Only $0.00015 per word!

Winston AI detector

Winston AI: The most trusted AI detector. Winston AI is the industry leading AI content detection tool to help check AI content generated with ChatGPT, GPT-4, Bard, Bing Chat, Claude, and many more LLMs.
Only $0.01 per 100 words

Originality AI detector

Originality.AI Is The Most Accurate AI Detection.Across a testing data set of 1200 data samples it achieved an accuracy of 96% while its closest competitor achieved only 35%. Useful Chrome extension. Detects across emails, Google Docs, and websites.
EXCLUSIVE DEAL 10,000 free bonus credits

Jasper AI

On-brand AI content wherever you create. 100,000+ customers creating real content with Jasper. One AI tool, all the best models.


10x Your Content Output With AI. Key features – No duplicate content, full control, in built AI content checker. Free trial available.

The Bellman equation – OpenAI’s reinforcement learning algorithm for artificial intelligence

In mathematics, Q is used to denote a rational number, or “a number that can be expressed as the quotient or fraction of two integers”. OpenAI’s use of Q* may refer to the Optimal Value Function in the Bellman Optimality Equation. In other words, Q* is the optimal solution (by definition) of an efficiency optimization algorithm. It’s not hard to see how efficiency optimization relates to the work of OpenAI.

The Bellman equation is a formula that allows us (or a machine) to make the best-informed decision at each stage of a multi-stage process. Named after Richard E. Bellman, the award-winning Brooklyn-born mathematician, helps to find a solution to a complex, multi-stage problem, by making the best decision at each stage, given what is known at that given stage. The person (or computer) running the algorithm can plug in a priority, which is called the objective function, such as “minimizing travel time, minimizing cost, maximizing profits, maximizing utility” etcetera. The algorithm will then dictate the best possible actions to take to achieve the desired result.

What is a Bellman equation?

A Bellman equation may be written as Vπ(s) = R(s,a,s’) + γ ∑ P(s’|s,a) Vπ(s’)

Q* plays a role in this equation, where Q* is the ‘Optimal Function’, S stands for ‘State’ and A refers to ‘Action’ in q∗(s′, a′)

q∗(s′, a′) is a state-action pair, fundamental to any Bellman equation, when moving from a current state (or given state) to the next state in a process.

This concludes our maths lesson because I’m not the mathematician required to explain any of this.

Final thoughts

We will surely hear more about this mysterious ‘project Q*’ in the near future (and sooner than OpenAI intended, no doubt). This is all there is to know about how OpenAI is pushing “the veil of ignorance back and the frontier of discovery forward” for Q-Learning, and its machine learning applications in AI. Perhaps Sam Altman will return with more power to reveal this secretive project soon. When OpenAI finally releases this new mathematical mode, you’ll hear about it here first!