# One dispatch, 160 files classified, the rubric was the bug

> Dispatched an Explore agent to classify 160 wiki entries. The rubric the agent invented mapped to a phrase that appears in 142 of those files.

**Canonical URL**: https://agentcookbooks.com/blog/one-dispatch-160-files-classified-the-rubric-was-the-bug/

**Published**: 2026-05-12

**Tags**: claude-code, skills, subagents

---

The audit asked a simple-sounding question: across 160 skill pages on the wiki, which `## Receipts` sections are firsthand, which are placeholder TODOs, which are plausible-generic? Reading all 160 sequentially in the main context would have burned around 150K tokens — enough to push the conversation toward compaction before the audit produced anything useful. Dispatching a single `Explore` agent saved roughly 80K of that. The token win was real. The classification was the problem. The prompt told the agent to classify by *evidence* (firsthand session notes, dated activity, named files), but didn't enumerate what counted as evidence. The agent inherited a rubric of its own and applied it consistently — a single stock phrase ("When this skill activates, the assistant will...") appears in 142 of the 160 files as part of the upstream skill-loader pattern, and the agent treated its presence as the marker for "generic." The number it returned was technically correct against its own rubric; the rubric wasn't what the audit asked for. The lesson saved afterward: when you delegate a semantic classification, prescribe the evidence, not the verdict.

## What I ran

Just one dispatch. An `Explore` agent with a tight prompt: read only the `## Receipts` section per file (grep for the header, read the next 30–50 lines), classify each as firsthand / TODO / generic with one-line evidence, tally against expected import waves so the agent could cross-check itself.

The expected counts were knowable from project memory:

- Haines marketing skills: 40 (bulk-imported with plausible-generic receipts by design)
- obra superpowers: 5 (TODO Receipts by policy)
- agency-agents personas: 8 (TODO by policy)
- Osmani agent-skills: 4 (TODO by policy)
- ppt-master: 1 (TODO by policy)

So the prompt could ask the agent to confirm the TODO bucket landed at 18. That's a greppable cross-check the agent could run on itself — if its TODO count came back off, something was wrong before it ever returned.

## What happened

The agent returned a 160-row table plus a summary. Bucket counts:

- TODO = **18** ✓ (matched expectation exactly — the 4 Osmani + 8 personas + 5 obra + 1 ppt-master)
- firsthand = **5**
- generic = **137** (residual)

Wall time: about 3 minutes. Tokens: roughly 70K in the agent's context (vs. an estimated 150K if read sequentially in the main session). Result, on the face of it, useful.

The TODO count being mechanically verifiable mattered. It was the trust anchor — if that number came back wrong, the rest of the table couldn't be relied on either. It came back exactly right, which was a real signal that the agent had read the right sections in the right files. The firsthand count was different.

The agent's evidence column for firsthand listed five skills (`aeon`, `cold-email`, `hypothesis-generation`, `pytorch-lightning`, `write-a-skill`) on the basis of "**Pattern that works:**" phrasing in their receipts. That was the classifier the prompt had handed the agent — phrasing suggesting structured firsthand notes. The prompt did not say "verify that this phrase appears only in firsthand entries." It said "classify based on the phrasing and surrounding specificity."

When the verification step ran a grep for that exact phrase across all 160 files, the count came back as **142**. The phrase the agent had used to single out 5 firsthand entries was present in 89% of the wiki. The agent's actual classifier — the one it was running — wasn't phrasing. It was "phrasing + specificity in surrounding context," with "specificity" being a judgment call the prompt had implicitly delegated.

## Where it drifted

The skill — [`dispatching-parallel-agents`](/skills/dispatching-parallel-agents/) — protected main context as expected. That part worked. The drift was in the prompt, not the dispatch.

For **numeric-tally tasks** ("count files matching pattern X"), the rubric is grep. The dispatched agent runs the grep, returns the count, and the verification is mechanical. The TODO=18 number landed in this class. It came back exactly right.

For **semantic-classification tasks** ("is this firsthand or generic?"), the rubric is judgment. If the prompt doesn't specify the evidence requirements, the agent fills the gap with its own implicit rubric and reports back with confidence. The firsthand=5 number landed in this class. It came back as "5 entries where I judged the surrounding context was specific enough" — which the prompt didn't ask for and the verification couldn't cleanly check.

A tighter rubric would have said something like:

> Classify as `firsthand` only if the receipts section names at least two of: (a) a specific calendar date, (b) a specific file path under `receipts-drafts/`, (c) a specific git commit hash, (d) a measured number with units. Otherwise classify as `generic` or `TODO`.

That rubric is greppable. The agent's judgment grounds out in regex on the receipts content. The same dispatch would likely have returned zero firsthand from the bulk-imported Haines batch and surfaced the actual firsthand entries — the ones with `## 2026-XX-XX —` timestamps and specific file references.

There's a related issue worth flagging. The token win wasn't the lesson. Saving 80K tokens by dispatching is real, but the same 80K savings is available for any reasonably bounded task — that's just how subagent dispatch works. The lesson was that **the dispatch outsourced not just the reading, but the judgment**. The judgment is the part that needed an explicit rubric in the prompt. The reading didn't.

## What I'd change

Three things.

**Name the evidence requirements explicitly when the classification is semantic.** "Classify as firsthand if and only if X, Y, Z" rather than "classify based on specificity." The agent's confidence comes from the rubric; if the rubric is implicit, the confidence is unearned. Greppable rubric beats judgment rubric whenever the task permits.

**Use cross-check tallies as a trust anchor.** Asking the agent to confirm the TODO bucket landed at 18 was the right move — it produced a verifiable number that the rest of the table could be sanity-checked against. If the cross-check fails, everything else is suspect. If it passes, you've ruled out one whole class of failure (the agent read the wrong files, or skipped sections, or hallucinated counts).

**Verify the rubric before trusting the classification, not just the number.** This run produced "firsthand = 5" — a believable number. But `grep -l '**Pattern that works:**' src/content/skills/*.md | wc -l` returned 142, and that delta is what surfaced the rubric problem. Number alone isn't enough; the *predicate* the agent used to produce the number is what needs verifying. The [`verification-before-completion`](/skills/verification-before-completion/) skill is the natural pair here — run it on the dispatched output, not just on the operator's claims.