agent-eval

A Claude Code skill from Affaan M's everything-claude-code repo that wraps the agent-eval CLI for head-to-head comparison of coding agents (Claude Code, Aider, Codex) on YAML-defined tasks. Each task pins a commit, defines pass criteria via pytest / grep / LLM judges, runs every agent in an isolated git worktree, and reports pass rate, cost, time, and consistency across repeated runs.

Replace 'which coding agent feels better' with pass-rate + cost + time + consistency on your own codebase

Source Affaan M

License MIT

First documented 2026-05-12

Receipts TODO

Evals Agents

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

compare Claude Code vs Aider on my tasks
benchmark coding agents head to head
agent eval YAML task definition

What it does

agent-eval is the coding-agent comparison skill in Affaan M’s everything-claude-code — see skills/agent-eval. It wraps the upstream agent-eval CLI (referenced from github.com/joaquinhuigomez/agent-eval) to systematize head-to-head agent comparisons that otherwise run on vibes. Every “which agent is best?” claim ends up being one engineer’s anecdote — this skill replaces it with pass rate, cost, time, and consistency on real tasks.

Tasks are declarative YAML files: name, description, repo, files, prompt, a judge array with one or more criteria, and a pinned commit for reproducibility. Three judge types: code-based (deterministic — pytest, npm run build), pattern-based (grep matching a regex against files), and model-based (LLM-as-judge with a prompt asking whether the implementation satisfies a stated invariant). The skill recommends at least one deterministic judge per task so the result doesn’t depend entirely on LLM judgment.

Each agent run gets its own git worktree from the pinned commit — no Docker needed. The agent gets the prompt, modifies files in its worktree, and the judges run against the worktree state. Repeated runs (--runs 3) capture variance, since agents are non-deterministic. The output table ranks agents by pass rate, cost, time, and consistency, surfacing the 95%-at-10×-cost case where the cheaper agent is the right choice.

When to use it

Considering switching primary coding agent and need data, not anecdotes
Evaluating a model version bump (Claude Sonnet → Opus, model A → model B) on your real workload
Regression-checking an agent after a tool or model update
Producing a team-shareable agent-selection decision with cost and consistency in the table
Building a small task fixture set that lives in version control as test data

When not to reach for it:

One-off “is this agent good enough” gut checks — overhead exceeds benefit
Page / API performance benchmarking — that’s benchmark
Regression tests for human-written code — that’s ai-regression-testing
General LLM evaluation harness work — agent-eval is specifically for coding agents on coding tasks

Install

From affaan-m/everything-claude-code at skills/agent-eval/. Drop the folder into ~/.claude/skills/agent-eval/. The skill is markdown; the runtime is the upstream agent-eval CLI from github.com/joaquinhuigomez/agent-eval, plus whichever coding agents you want to compare (Claude Code, Aider, Codex, etc.) installed and authenticated locally. Tasks live in a tasks/ directory at the project root.

What a session looks like

Pick 3–5 real tasks. The skill is explicit: real workload, not toy examples. Each task gets a YAML file with name, prompt, target files, pinned commit, and judge criteria.
Write the judge. At least one deterministic judge (pytest or npm run build). LLM judges add noise — useful as a second opinion, not as the only signal.
Run the agents. agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3 creates a fresh worktree per agent per run, hands the prompt over, runs the judge.
Read the comparison table. Pass rate (3/3 vs 2/3), cost ($0.12 vs $0.08), time (45s vs 38s), consistency (100% vs 67%). The skill flags that a 95% agent at 10× the cost may not be the right choice.
Iterate the task definitions. They’re test fixtures — version them, treat them as code, expand the set as new categories of work appear.

The discipline that makes it work: pinned commits and isolated worktrees. Without the pin, “agent X passed yesterday but fails today” is ambiguous (did the agent regress, did the codebase change?). The worktree isolation means a bad agent run can’t corrupt the base repo or interfere with parallel runs.

Receipts

TODO — to be filled in from a real session. Once the CLI has been run against a real task set, this section will capture: which agent actually won on pass rate vs. cost (and whether the win was decisive or marginal), the variance across --runs 3 for the same agent on the same task (this is where consistency shows up), which judge type was most predictive — deterministic pytest vs. grep vs. LLM — and whether the LLM judge agreed with the deterministic judge on borderline runs.

Source and attribution

From Affaan M’s everything-claude-code — an MIT-licensed skill collection covering harness construction, agent ops, video, payments, and platform-specific patterns.

License: MIT.

Quoting the cost-vs-pass-rate rule verbatim: “Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice.” That’s the wedge — pass-rate-only comparisons miss the operational reality of paid-per-call agents in CI loops.