# Verification caught half the audit claims weren't greppable

> Ran evidence-before-assertion on a four-claim audit. Two passed grep. The other two were the agent's content read, accepted on trust.

**Canonical URL**: https://agentcookbooks.com/blog/verification-half-the-audit-claims-werent-greppable/

**Published**: 2026-05-12

**Tags**: claude-code, skills, evals

---

The [`verification-before-completion`](/skills/verification-before-completion/) skill says "evidence before assertions always" — run the verification command before claiming the work is done. Most of the time the verification is mechanical: did the test pass, does the file compile, is the count correct. On the day this run happened, the audit being verified had four numeric claims, and the skill caught that **only two of the four could be verified by grep**. The other two were the dispatched agent's content read, accepted on trust. The number didn't change. The epistemic class did — the two unverified claims got demoted from "measured" to "asserted," which is the difference between a receipt and a guess. The audit being verified was [the one-dispatch / 160-files Explore run](/blog/one-dispatch-160-files-classified-the-rubric-was-the-bug/), and the demotion is what kept its later writeup honest about which numbers were rubric-based judgment versus grep-derived fact.

## What I ran

The setup: a 160-skill receipts audit had just returned. The dispatched `Explore` agent classified each `## Receipts` section into one of three buckets and reported back with four numbers:

- Total skill files = 160
- TODO marker count = 18
- Firsthand count = 5
- Generic count = 137 (residual)

Before reporting those numbers to the operator as truth, I applied `verification-before-completion`. One command per claim. Each command had to either confirm the number or surface a delta.

## What happened

Two commands passed cleanly.

```bash
ls src/content/skills/*.md | wc -l
# → 160 ✓
```

```bash
grep -l "TODO — to be filled in" src/content/skills/*.md | wc -l
# → 18 ✓
```

The third command broke the assumption.

```bash
grep -l "**Pattern that works:**" src/content/skills/*.md | wc -l
# → 142 ✗  (expected 5)
```

The agent had classified 5 skills as firsthand based on "Pattern that works:" phrasing in the receipts. But that phrase is near-universal across the wiki — 142 of 160 files use it. Phrasing alone didn't distinguish firsthand from generic. The 5-firsthand classification wasn't *wrong*; the agent had been reading surrounding context (specificity, repeatable steps, named constraints) and applying judgment. But that judgment is a different epistemic class than grep.

The fourth number — generic = 137 — was the residual: 160 minus 18 minus 5. If you trust the agent's content read on the firsthand count, the residual holds. If you don't, the residual is also untrustworthy. Two numbers were verified by single command. Two needed a content re-read, or a demotion to "agent's judgment, accepted on trust."

## Where it drifted

The skill's evidence-before-assertion frame did exactly what it's designed to do: it forced an explicit verification step before reporting the audit numbers as truth. What it surfaced wasn't that any number was wrong. It was that **two of the four claims belonged to a different epistemic class than the other two**.

Greppable claims (TODO marker count, total file count) ground out in a single command. The verification *is* the proof. You can re-run it, share it, link to it. The number is reproducible.

Judgment claims (firsthand classification) ground out in a content read. The verification is the original read, and re-running it produces a *new* judgment, not a confirmation of the old one. If a different model or a different prompt classified the same 160 files, the firsthand count would likely move. The number is unstable in a way the greppable claims aren't.

That asymmetry is invisible if you treat all four numbers the same way. Saying "verified, all four numbers checked" is more confident than the evidence supports. Saying "two verified by grep, two are the agent's content read accepted on trust, here are the commands" is the actual situation.

## What I'd change

Three things.

**Specify the evidence class up front when reporting numbers.** Instead of "I classified 160 files as TODO=18 / firsthand=5 / generic=137," report "TODO=18 by grep, total=160 by `ls`, firsthand=5 by content read (specificity in surrounding context), generic=137 by residual." The reader sees which numbers are reproducible by command and which require accepting a judgment. The total claim is no less true. The reader's calibration improves.

**Demand greppable evidence in dispatch prompts when the rubric allows.** For the firsthand classification, the prompt could have said: "Classify as firsthand only if the section cites a specific date, a specific file path under `receipts-drafts/`, or a specific git commit." That rubric *is* greppable. The agent's judgment grounds out in regex on the receipts files. The skill surfaced the asymmetry; the dispatch prompt could have prevented it.

**Don't conflate "verified" with "the agent told me so."** The honesty cost of reporting two of four numbers as judgment instead of evidence is two extra sentences. The cost of *not* doing it is the chance a downstream decision treats the firsthand count as if it were the greppable count — and acts on a number that would change under re-classification. Two sentences vs. a wrong decision later is a trade you take every time.

The skill's official line is "evidence before assertions always." The implicit corollary, the one this run surfaced: when the evidence is the agent's content read rather than a command output, call it what it is. The number doesn't change. The shape of what's being claimed does.

A later [corpus-wide E-E-A-T audit](/blog/eeat-audit-anonymity-policy-as-authority-ceiling/) applied the same grep-vs-judgment discipline at scale — three of the headline findings (143-of-197, 8-of-20, 87-word) were greppable; the Authority sub-score was the judgment-class claim. Same epistemic asymmetry, different audit.