Trace reviews ship weekly with one named owner.

Appoint one trusted-taste expert as the eval benevolent dictator, committees stall the loop

Claim

For LLM eval work, appoint a single person whose taste the team trusts as the benevolent dictator on what counts as a failure. A committee bogged down debating the rubric never ships an eval; one trusted taste arbiter ships and the rubric tightens through use.

Mechanism

Open coding is judgment work. A committee needs consensus on the definition of "wrong" before any traces get coded, and that conversation rarely converges before the team loses momentum. A single arbiter encodes a coherent taste and writes it down. The rubric then evolves through actual reviews, not through abstract debate. Domain experts, often the product manager, make the best arbiters because they hold both the user perspective and the product reality.

Conditions

Holds when:

The arbiter has both domain expertise and team trust.
The team accepts that the rubric will drift, intentionally, as more traces are reviewed.
The arbiter is willing to be visibly wrong sometimes and revise.

Fails when:

The arbiter has authority but not taste (political appointment).
The team cannot live with one person's judgment and re-litigates every call.
Multiple stakeholders have legitimate but incompatible quality bars (competing customers, regulators).

Evidence

"When you're doing this open coding, a lot of teams get bogged down in having a committee... You can appoint one person whose taste that you trust."

Hamel and Shreya prefer the product manager for the role because PMs hold both user and product context.

· Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

Signals

Trace reviews ship weekly with one named owner.
Rubric documents are updated by the arbiter in flight, not in committee meetings.
The team stops debating the rubric and starts debating specific traces.

Counter-evidence

For high-stakes regulated domains, single-person arbitration may not be acceptable. Multiple-arbiter calibration (with disagreement tracked and resolved) is the safer path there.

Cross-references

Sample 100+ traces, write one free-form note per trace, let an LLM cluster the notes, humans first, machines second, the workflow the arbiter runs
When the decision-maker is unclear, you are it, be a force for positive momentum, Claire Hughes Johnson's structurally similar rule for cross-functional decisions