a builder's codex
codex · operators · Hamel Husain · ins_benevolent-dictator-not-committee

Appoint one trusted-taste expert as the eval benevolent dictator, committees stall the loop

By Hamel Husain · Independent ML consultant and Berkeley PhD researcher · 2026-04-28 · podcast · Evals as error analysis, the benevolent dictator, LLM judges

Tier B · TL;DR
Appoint one trusted-taste expert as the eval benevolent dictator, committees stall the loop

Claim

For LLM eval work, appoint a single person whose taste the team trusts as the benevolent dictator on what counts as a failure. A committee bogged down debating the rubric never ships an eval; one trusted taste arbiter ships and the rubric tightens through use.

Mechanism

Open coding is judgment work. A committee needs consensus on the definition of "wrong" before any traces get coded, and that conversation rarely converges before the team loses momentum. A single arbiter encodes a coherent taste and writes it down. The rubric then evolves through actual reviews, not through abstract debate. Domain experts, often the product manager, make the best arbiters because they hold both the user perspective and the product reality.

Conditions

Holds when:

Fails when:

Evidence

"When you're doing this open coding, a lot of teams get bogged down in having a committee... You can appoint one person whose taste that you trust."

Hamel and Shreya prefer the product manager for the role because PMs hold both user and product context.

· Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

Signals

Counter-evidence

For high-stakes regulated domains, single-person arbitration may not be acceptable. Multiple-arbiter calibration (with disagreement tracked and resolved) is the safer path there.

Cross-references

Open the interactive view → View original source → Markdown source →