Output quality scores improve week-over-week without manual review hours scaling proportionally

A separate grader agent in its own context window closes the output verification loop at production scale

By Anthropic · AI safety research company and Claude developer · 2026-05-06 · talk · Code with Claude

Claim

Deploy a separate grader agent in its own context window to evaluate whether output meets a defined success rubric. The grader runs independently of the generator and has no access to the generator's reasoning chain, only the output.

Mechanism

When a grader shares context with the generator, it inherits the generator's blind spots. A separate context window forces independent evaluation against the rubric. The rubric — not the generator — determines pass or fail. This creates a closed feedback loop: the generator produces, the grader measures, the delta drives improvement without scaling manual review linearly with volume.

Conditions

Holds when: the task has a defined, measurable success rubric. Output format is consistent enough for the grader to evaluate. The grader can access necessary ground truth or reference material.

Fails when: success criteria are vague or subjective. The grader and generator share overlapping context that smuggles in confirmation bias. The rubric itself is wrong.

Evidence

Announced at Code with Claude, May 6, 2026, as part of Claude Managed Agents public beta. Internal testing showed +8.4% improvement on docx file generation and +10.1% on pptx file generation after Outcomes was added.

"Agents do their best work when they know what 'good' looks like."

Signals

Output quality scores improve week-over-week without manual review hours scaling proportionally
The grader catches the same error class repeatedly, pointing to a generator training target
Task success rate and human-rated quality converge over time (grader is calibrated)

Counter-evidence

For tasks without a verifiable rubric, adding a grader adds latency and cost with no quality signal. The grader itself can be miscalibrated if the rubric is underspecified.

Cross-references

A trace alone teaches nothing; learning requires feedback attached to the trace (Harrison Chase): traces without outcome feedback are incomplete raw material
Error analysis is the most-skipped step in AI evals and gives the most leverage per hour invested (Hamel Husain): error analysis is the highest-leverage eval step most teams skip
Scheduled cross-session transcript reading extracts patterns and proposes memory updates for human review before they land (Anthropic): paired feature — Outcomes closes the output loop, Dreaming closes the memory loop