The Inference-Time POMDP Attacker And Why Your Eval Suite Is Lying To You

A new ICML 2026 paper hits 76% ASR on O1 and 78% on GPT-5-chat. If your defense pack tests on static prompts, you are not where you think you are.

May 18, 2026

A paper called Metis went up on arXiv on May 11. It is short, the math is clean, and the headline number is the one that should re-prioritize every operator’s eval suite for the next two quarters.

From the abstract, verbatim:

“Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x.”

The 89.2% average matters. The 76% on O1 matters more. But the line I want to pull forward is the diagnostic one from the conclusion:

“Current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.”

That sentence identifies the failure mode, not just the symptom. Most production safety stacks I have audited do four things well: input filters, output filters, token-level deny lists, and a CVE regression suite of known-bad prompts. None of those defend against an attacker who treats the target as a partially observable Markov decision process, runs an inference-time policy-optimization loop over its own reasoning trace, and ratchets a semantic gradient toward a goal across three or four turns.

Why this is the right week to act on it:

The attack is reproducible. The paper proposes a self-evolving metacognitive loop. The implementation cost for a red team is bounded by the paper itself plus a few hundred lines of orchestration. Expect open-source reproductions in 4-8 weeks.
The defense is not a single regex. Defending against an inference-time POMDP attacker requires runtime state. You need to carry a session-level signal across turns and trip a circuit breaker when the cumulative attack-similarity score crosses a threshold. That is a runtime governance primitive, not a static SAST rule.
The 76% on O1 is the credibility signal. If a single team can break a frontier reasoning model that hard with a single algorithm, the working assumption that “RL on frontier models makes agents safer” is already wrong for the threat model that matters.

What the operator should do this week, in order of decreasing priority:

One — wire Metis as a regression target. Treat it the way you treat a CVE you have not yet patched. Add a suite of 12 fixture prompts that mirror the three failure modes the paper describes: reasoning-trace injection, semantic-gradient climbing, and goal-drift across turns. Mark the suite regression-target-only with an expected pass rate of 70% today and ratchet to 100% as your defense lands.

Two — add session-state to your tool-call boundary. If your firewall is decorator-in-process, the right move is a deny rule that fires when the cumulative score across the last 5 turns crosses a tunable threshold. I am shipping this as metis_pomdp_jailbreak_2026_05_defaults() in agent-airlock v0.8.2 this week, marked BETA. Three rules: trace-injection regex over a 3-turn window, semantic-gradient over the difflib ratio, and a circuit breaker over the aggregate.

Three — accept that the eval suite is the smaller half of the problem. Static evals catch where the attacker started, not where the attacker is going. The substantive fix is a runtime drift signal you can trust. Per-agent rolling z-score + EWMA against a behavioral baseline catches the trajectory, not the input. That is what mnemo-baseline ships and what the next Production Agent post will be about.

A note on the 8.2-11.4x token-cost reduction. Most production defenses scale linearly with attacker token spend. Metis reverses the slope of the attacker’s cost curve. If you assumed bot traffic would self-throttle on cost, that assumption is no longer load-bearing.

What the eval suite is lying about. If your shipping evals are tau-bench, MMLU-style, golden-traces from your customer logs, and a static CVE-regression pack, your reported pass rate is a measurement of last quarter’s threat surface. The 76% O1 number says next quarter’s threat surface is already deployable from a single arXiv paper. That gap is the production safety margin you actually have. Most teams I have talked to in the last 60 days have not measured it.

I am not selling a panic. The fix is on a 4-12 week timeline if your governance plan is already in place. If you do not have a runtime governance plane and you are still relying on input/output filters plus prompt-level red-team prompts, the work is bigger.

Two specific things I would do this week:

Read the paper. It is short. arXiv:2605.10067.
Add 12 fixture prompts to whatever your CI-time eval surface is, pinned to the three failure modes the paper names. Even before any defensive code lands, the regression bar starts giving you a signal you can track.

I will follow up next week with the agent-airlock v0.8.2 release notes and the regression report on the preset’s behavior against the public fixture set. If you ship Claude, OpenAI, or Gemini in production, this is the regression target to add this month.

Subscribe to The Production Agent if this is your work. No demos, only systems that survive production.

The Production Agent

Discussion about this post

Ready for more?