TL;DR
- The gist: OpenAI has released research validating Chain-of-Thought monitoring as a key safety mechanism for GPT-5 Thinking, introducing a “monitorability tax” for safer AI deployment.
- Key details: The company introduces 13 new safety evaluations and finds that smaller models with higher reasoning effort are often more monitorable than larger, low-effort models.
- Why it matters: This validates a behavioral approach to AI safety, offering enterprises a concrete way to trade inference compute for verifiable transparency in high-stakes deployments.
- Context: The findings diverge from the mechanistic interpretability strategies pursued by Google and Anthropic, suggesting current RL methods do not yet induce hidden reasoning.
OpenAI has validated Chain-of-Thought (CoT) monitoring as a primary safety mechanism for its frontier models, explicitly naming “GPT-5 Thinking” as its most monitorable system to date. The research introduces a “monitorability tax,” a necessary trade-off where inference compute is spent on transparency rather than raw capability.
Released alongside a new suite of 13 safety evaluations, the findings suggest that Reinforcement Learning (RL) optimization does not currently induce “steganography,” or hidden reasoning, in model outputs. This contradicts widespread fears that advanced AI might learn to conceal deceptive thoughts from human auditors.
By focusing on behavioral monitoring, the Microsoft-backed lab diverges from competitors like Google DeepMind and Anthropic, who prioritize mechanistic interpretability (mapping internal neurons) to solve the “black box” problem.
Promo
Calculating the Monitorability Tax
By defining safety as a measurable resource rather than an abstract ideal, OpenAI has formalized a new economic principle for AI deployment. Defined as the “Monitorability Tax,” this concept quantifies the additional inference compute required to make a model’s decision-making process transparent to human auditors.
Emerging from a specific trade-off identified in a technical paper, the finding suggests that a smaller model running at a high reasoning effort, generating longer Chain-of-Thought (CoT) sequences, often yields better visibility into potential misbehavior than a larger model running at low effort. A new OpenAI blog post defines this dynamic:
“Notably, we observe a tradeoff between reasoning effort and model size: when a model is run at low reasoning effort, a smaller model run at higher reasoning effort can reach comparable capability while being easier to monitor. This comes at the cost of increased inference compute, which we refer to as a monitorability tax.”
For enterprise customers and safety teams, this presents a concrete operational choice. Organizations may need to accept higher inference costs per token to ensure that an AI agent’s internal logic remains auditable.
Explaining the practical application of this finding, Miles Wang, a researcher at OpenAI, noted that this trade-off provides a clear path for deploying capable models in high-stakes environments:
“We identify an actionable ‘monitorability tax’: given a larger model, you can train a smaller model of equal capabilities and higher monitorability using more inference-time compute.”
Addressing the theoretical fear known as “steganography,” the research investigates whether models might learn to hide deceptive reasoning or encode secret messages within their CoT outputs. Previous studies suggested that Reinforcement Learning (RL) could inadvertently train models to compress their thoughts, making them illegible to humans.
Contrary to these concerns, OpenAI’s monitorability research suggests that the drive for efficiency has not yet forced models to decouple their stated logic from their actual computations. The team wrote, “Reinforcement learning optimization at current frontier scales does not materially degrade chain-of-thought monitorability.”
However, the related paper cautions that this stability is not guaranteed indefinitely. As models scale further, the pressure to be efficient might eventually conflict with the need to be legible, potentially requiring new interventions to preserve this visibility.
Inside the 13-Point Evaluation Suite
To standardize how these properties are measured, OpenAI presents results from a benchmark suite comprising 13 specific evaluations across 24 environments. These tests are designed to stress-test model honesty and detect specific failure modes like reward hacking or sycophancy.
Methodologically, the team divides these tests into three distinct categories, each targeting a different aspect of model behavior. According to Wang’s summary of the methodology:
“We introduce 3 eval archetypes, a metric, and a broad suite of 13 evals. Intervention evaluations… apply a controlled change in the agent’s environment and test whether a monitor can detect why behavior changed. Process evaluations focus on tasks with a limited number of valid solution paths… we test whether a monitor can detect which steps the model actually took. Outcome-property evaluations apply when a property of the model’s output can be reliably measured.”
In the “Intervention” category, researchers might provide a model with a hint that is known to be incorrect to see if the model’s reasoning changes. If the model adopts the bad hint but claims to have solved the problem independently, the monitor flags the discrepancy.
Benchmarking revealed measurable variance in performance across the models tested. The study compared the safety visibility of the company’s newest flagship against its predecessors, including the OpenAI o3 and o3-mini series. Wang noted, “GPT-5 is the most monitorable model we studied. And monitoring the CoT is much better than just actions!”
Quantifying the value of monitoring thoughts versus actions, the study found that in experiments involving code generation and tool use, monitoring the CoT provided a stronger signal for detecting misbehavior than examining the final output or tool calls alone.
To further enhance this visibility, the researchers experimented with a “follow-up” technique. By programming the system to ask the model reflective questions after a task, such as “Did you do something sketchy?”, auditors could extract reasoning that was initially unverbalized.
Summarizing the broader implications of these tests, the research team emphasized that current methods are effective but imperfect, stating, “We find that most frontier reasoning models are fairly monitorable, though not perfectly so.”
The Behavioral Bet: Reading Minds vs. Mapping Neurons
While Google and Anthropic attempt to map the neural circuitry of their models, OpenAI is doubling down on reading the “mind” of the model through its output. This marks a strategic divergence in the industry’s approach to the “black box” problem.
Google DeepMind just released Gemma Scope 2, a suite of tools trained on 110 petabytes of data to trace the internal circuits of the Gemma 3 model. Similarly, Anthropic’s interpretability framework from earlier this year focuses on reverse-engineering the neural activations that correspond to specific concepts or behaviors.
Rather than identifying which specific neurons fired, OpenAI’s approach focuses on the psychology of the output: predicting future actions based on the stated reasoning. As the team explained, “To preserve monitorability, we must be able to measure it.”
Despite its utility, this behavioral approach has distinct limitations. Critics argue that a model’s Chain-of-Thought is a performance, not a direct window into its processing, raising questions about the authenticity of the output.
This skepticism centers on the potential disconnect between a model’s stated reasoning and its actual computational process.
The concern is that the Chain-of-Thought might function merely as a performative layer, akin to a carefully sanitized public statement, while the model’s true decision-making logic remains hidden within its neural activations. In this scenario, the visible reasoning serves to appease human monitors, potentially masking deceptive or misaligned objectives that exist only in the model’s “subconscious” parameters.
The reliability of this “public statement” is also highly volatile, with the model’s internal state proving susceptible to external influence.
The research found that minor changes in input phrasing could significantly alter the model’s expressed reasoning and its final decision. Muratcan Koylan, AI Agent Systems Manager at 99ravens.ai, observed that “the paper proves how sensitive models are to prompt framing. They found that simple additions to prompts reliably change model behavior.”
Incredible paper. Quick prompt engineering findings:
Models are becoming more capable and autonomous, so we’re losing the ability to directly supervise every decision.
Prompt engineering is not just about the initial instruction, but about the iterative extraction of a model’s… https://t.co/XckHjyAUUS pic.twitter.com/3V3sUBcEr0
— Muratcan Koylan (@koylanai) December 19, 2025
Such volatility highlights the need for a “defense in depth” strategy. While behavioral monitoring offers immediate utility for catching reward hacking and deception, it may need to be paired with the deeper, mechanistic insights pursued by competitors to ensure safety as capabilities advance.

