OpenAI's o-series models don't just predict the next word. They reason their way to an answer — and a new paper explains how, in terms that actually matter.
When o3 scored in the 87th percentile on the Arc Prize benchmark — a test designed to require genuine reasoning, not pattern matching — the AI research community noticed. Not because of the number, but because of the approach: o3 was spending compute at test time to "think through" problems, rather than just retrieving a pre-learned answer.
A recent technical report from OpenAI (with key co-authors from the earlier chain-of-thought scaling work) describes what's really happening. Here's what matters.
What chain-of-thought actually is
Most language models generate answers in one pass. You ask a question, they produce an answer — tokens appearing one by one, each conditioned on everything before it.
Chain-of-thought changes this. Before producing the final answer, the model generates intermediate reasoning steps. Not as decoration — as computation. Each step feeds into the next, and the model learns to use this reasoning space to handle multi-step problems it couldn't solve in a single pass.
The paper describes this as a form of extended computation: when you give the model more "thinking tokens," it can handle harder problems — not because it's been trained on harder problems, but because it can use compute at inference time to work through complexity it couldn't compress.
Why this matters for how we think about intelligence
The old frame: models learn patterns, retrieve them at test time. Scale up the patterns, scale up the performance.
The new frame (from the o-series work): models learn to reason, and you can allocate more reasoning compute to harder problems. This isn't memorization — it's something closer to problem-solving.
The distinction matters practically. If o3 is just pattern-matching on steroids, more scale should eventually close the gap. But if it's reasoning, then capability gaps might persist longer — and architectural choices (how you train the reasoning process itself) might matter as much as raw model size.
What the scaling data shows
The paper presents clean data showing that, for hard problems, compute spent on reasoning scales better than compute spent on model size. In other words: a smaller model that thinks harder often outperforms a bigger model that thinks less.
This has real implications for anyone building with these models. If you can give users a "think harder" dial, you can get better results on hard tasks without needing to upgrade the underlying model. The API implications of this are significant — it suggests inference-time compute allocation will become a first-class parameter, like temperature or max tokens.
What's still unresolved
The paper is honest about the gaps. Chain-of-thought reasoning improves performance on structured problems — math, logic, code — but the gains on fuzzy, open-ended tasks are smaller. The model's reasoning is only as reliable as its world model. And the question of whether "thinking longer" introduces new failure modes (confident wrong answers that look well-reasoned) remains open.
The honest summary: this is a real capability step, not a benchmark trick. The reasoning process is doing something — and understanding it better is worth your time.