OpenAI's latest o1 model family tries to emulate 'reasoning' – tho might overthink things a bit
'Chain of thought' techniques mean latest LLM 'better' at stepping through complex challenges
OpenAI on Thursday introduced o1, its latest large language model family, which it claims is capable of emulating complex reasoning.
The o1 model set – which presently consists of o1-preview and o1-mini – employs so-called "chain of thought" techniques.
In a 2022 paper, Google researchers described chain of thought as "a series of intermediate natural language reasoning steps that lead to the final output."
OpenAI has explained the technique as meaning o1 "learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."
To understand the chain of thought techniques, consider the following prompt:
The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
According to the Google paper, GPT-3 could not reliably produce an accurate answer to that prompt.
The current free version of ChatGPT – powered by OpenAI's GPT-4o mini model – already has some power to emulate "reasoning," and responds to the prompt by showing how it reached the correct answer. Here's its output:
To find out how many apples the cafeteria has now, you can follow these steps:
- Start with the initial number of apples: 23 apples.
- Subtract the apples used for lunch: 23 - 20 = 3 apples.
- Add the apples they bought: 3 + 6 = 9 apples.
So, the cafeteria now has 9 apples.
That's a pleasingly detailed and correct response.
In OpenAI's explainer for o1 and chain of thought tech, which should be able to do the above reasoning, it offers examples including the AI system being asked to solve a crossword puzzle after being prompted with a textual representation of a puzzle grid and clues.
GPT-4o can't solve the puzzle. o1-preview solves the puzzle, and explains how it did it – starting with output that analyzes the puzzle itself as follows:
Understanding the Grid:
- All the Across words start at the leftmost column and are 6 letters long.
- All the Down words start at the top row and are 6 letters long.
- The numbers correspond to the starting square of each word.
The model's output later explains how it went about solving the puzzle, as follows:
1 Across (ESCAPE) intersects with 1 Down (ESTATE) at the E.
2 Across (SEALER) aligns perfectly with the letters from the Down words in the second row.
3 Across (TERESA) uses the T from the third letter of 1 Down (ESTATE) and fits with Down words.
Each letter of the Across words matches the corresponding letter in the Down words where they intersect.
That response above is chain of thought at work. By training and configuring the neural network model to write out steps to take, it appears to home in on a solution far better than by predicting the answer straight from the question as before. It's not actually thinking or reasoning; it's emulating that process by writing out the steps it ought to take and then allowing that chain of predictions to guide it to an answer.
OpenAI likes that output, for two reasons.
One is that "chain of thought reasoning provides new opportunities for alignment and safety," according to the explainer article. "We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles.
"We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios."
The other is that o1 allegedly smashes its predecessors on OpenAI's own benchmarks – which can't be bad for business.
Your mileage may vary.
- Google sued for using trademarked Gemini name for AI service
- AMD sharpens silicon swords to take on chip and AI rivals
- EU kicks off an inquiry into Google's AI model
- Nvidia CEO to nervous buyers and investors: Chill out, Blackwell production is heating up
Under the hood
"o1 is trained with RL [reinforcement learning] to 'think' before responding via a private chain of thought," explained Noam Brown, research scientist at OpenAI, in a social media thread. "The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We're no longer bottlenecked by pretraining. We can now scale inference compute too."
What's new for OpenAI here is that adding computational resources to the inference phase – referred to as "test-time compute" – improves results. That's good news for Nvidia and cloud AI providers who want to sell those kind of compute resources.
This release is a real milestone; it's the first real sign that AI is moving toward something more advanced
What it will cost to use the model is another matter. OpenAI does not disclose how much test-time compute was required to approach the 80 percent accuracy figure cited in its "o1 AIME [USA Math Olympiad] accuracy at test time" graph. It could be a significant amount.
Brown claims that o1 can take a few seconds to refine its answer – that's already a potential showstopper for some applications. But he adds that OpenAI foresees its models calculating away for hours, days, or even weeks. "Inference costs will be higher, but what cost would you pay for a new cancer drug?" he asked. "For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots."
The answer to the cost question may be: "How much do you have?"
The reasonableness of reasoning
OpenAI’s docs call its new offerings “reasoning models.”
We asked Daniel Kang, assistant professor in the computer science department at University of Illinois Urbana-Champaign, if that’s a reasonable description.
"'Reasoning' is a semantic thing in my opinion," Kang told The Register. "They are doing test-time scaling, which is roughly similar to what AlphaGo does. I don't know how to adjudicate semantic arguments, but I would anticipate that most people would consider this reasoning."
Citing Brown's remarks, Kang said OpenAI's reinforcement learning approach resembles that used by AlphaGo, which involves trying multiple paths with a reward function to determine which path is the best.
Alon Yamin, co-founder and CEO of AI-based text analytics biz Copyleaks, told The Register that o1 represents an approximation of how our brains process complex problems.
"Using these terms is fair to a point, as long as we don't forget that these are analogies and not literal descriptions of what the LLMs are doing," he stressed.
"While it may not fully replicate human reasoning in its entirety, chain of thought enables these models to tackle more complex problems in a way that 'starts' to resemble how we process complex information or challenges as humans.
"No matter the semantics, this release is still a real milestone; it's more than just about LLM solving problems better; it's the first real sign that AI is moving toward something more advanced. And for those of us working in this space, that is exciting because it shows the tech's potential to evolve into a tool that works alongside us rather than for us."
Overthinking it?
Brown cautions that o1 is not always better than GPT-4o. "Many tasks don't need reasoning, and sometimes it's not worth it to wait for an o1 response vs a quick GPT-4o response," he explains. "One motivation for releasing o1-preview is to see what use cases become popular, and where the models need work."
OpenAI asserts that its new model does far better at coding than its predecessors. GitHub, a subsidiary of Microsoft, which has invested much in OpenAI, says that it has seen improvements when the o1 model is used with its code assistant Copilot. The o1-preview model proved more adept at optimizing the performance of a byte pair encoder in Copilot Chat's tokenizer library. It also found and fixed a bug in minutes, compared to hours for GPT-4o. Access to o1-preview and o1-mini in GitHub Copilot currently requires signing up for Azure AI.
Is it dangerous?
OpenAI's o1 System Card designates the model "Medium" risk for "Persuasion" and "CBRN" (chemical, biological, radiological, and nuclear) using its Preparedness Framework scorecard. GPT-4o also scored "Medium" in the "Persuasion" category but low for CBRN.
The System Card's Natural Sciences Red Teaming Assessment Summary notes that while o1-preview and o1-mini can help experts operationalize plans to reproduce known biological threats (qualifying as "Medium" risk), they don't provide novices with the ability to do so. Hence the models' "inconsistent refusal of requests to synthesize nerve agents" – which could also be written "occasional willingness" – "does not pose significant risk." ®
Bootnote
Simon Willison, whom we trust on AI matters, writes that OpenAI's o1 chain-of-thought models are "not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved 'reasoning' capabilities."