benchmark

A Claude Code skill from Affaan M's everything-claude-code repo for performance baselines and regression detection across four modes — page (Core Web Vitals via browser MCP), API (p50/p95/p99 latency), build (cold + HMR + test + lint time), and before/after comparison against a saved baseline. Stores baselines in .ecc/benchmarks/ as git-tracked JSON.

Measure performance baselines and detect regressions before / after a PR with concrete numbers

Source Affaan M

License MIT

First documented 2026-05-12

Receipts TODO

Evals

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

benchmark this page against my baseline
compare API latency before and after the refactor
is my build slower after the upgrade

What it does

benchmark is the performance baseline skill in Affaan M’s everything-claude-code — see skills/benchmark. It covers four measurement modes: page performance (Core Web Vitals via a browser MCP), API performance (latency percentiles + status codes + concurrency), build performance (cold build, HMR, test suite, type check, lint, Docker), and before/after comparison (/benchmark baseline then /benchmark compare after the change).

Page mode pulls LCP / CLS / INP / FCP / TTFB with explicit targets (LCP < 2.5s, CLS < 0.1, INP < 200ms, FCP < 1.8s, TTFB < 800ms), then resource sizes (total weight < 1MB, JS bundle < 200KB gzipped) and request counts. API mode hits each endpoint 100 times for p50/p95/p99, then runs 10 concurrent requests against the same endpoint. Build mode measures the dev feedback loop — the numbers a developer feels every change.

Comparison mode is the one that turns “I think it’s faster” into a table: before, after, delta, verdict (BETTER / WARN / WORSE). Baselines live in .ecc/benchmarks/ as JSON, git-tracked so the whole team shares the same target. Integration paths the skill calls out: run /benchmark compare on every PR in CI, pair with /canary-watch for post-deploy monitoring, pair with /browser-qa for full pre-ship checklist.

When to use it

Before and after a PR to measure performance impact concretely, not by feel
Setting up the initial performance baselines for a project so future PRs have something to compare against
“It feels slow” from users — get the numbers before guessing at causes
Pre-launch readiness — verify the targets (LCP, INP, bundle size) are actually met
Comparing your stack against an alternative (Next.js vs Remix, React vs Solid, etc.) on the same workload

When not to reach for it:

Comparing coding agents — that’s agent-eval
Catching functional regressions — that’s ai-regression-testing
One-off Lighthouse checks for a single page — overkill for that
Production monitoring — pair with a separate canary-watch skill

Install

From affaan-m/everything-claude-code at skills/benchmark/. Drop the folder into ~/.claude/skills/benchmark/. Page-mode requires a browser MCP server (Playwright or similar) configured in .mcp.json. API mode uses standard HTTP requests. Build mode runs the project’s existing build / test / lint commands. Baselines are stored in .ecc/benchmarks/ and committed alongside the code.

What a session looks like

Establish a baseline. /benchmark baseline — runs page metrics across the target URLs, API latency across the target endpoints, build timings. Stores in .ecc/benchmarks/<date>.json.
Make the change. Refactor, dependency upgrade, new feature — whatever the PR is.
Compare. /benchmark compare runs the same measurements and produces the delta table.
Read the verdict. LCP +200ms = WARN. Bundle -5KB = BETTER. Build +2s = WARN. The verdicts are the operator’s call to ship or block — the skill just surfaces the deltas.
Drill into the worst row. If LCP regressed by 200ms, the page-mode breakdown shows which resource is bigger or slower than before.
Iterate or commit the baseline. Either fix the regression and re-compare, or accept the change and commit the new baseline as the next target.

The discipline that makes it work: git-tracked baselines. A baseline that lives only on one developer’s machine is anecdote. A baseline committed to .ecc/benchmarks/ is the shared target that PR reviewers can hold against.

Receipts

TODO — to be filled in from a real session. Once the benchmark has been run against a real project, this section will capture: which of the four modes was the most informative for the PR in question (page vs. API vs. build vs. before/after), what the actual delta looked like for a real regression vs. a real improvement, whether the targets the skill ships (LCP 2.5s, INP 200ms, bundle 200KB gz) needed adjustment for the project’s actual baseline, and whether the JSON baseline format played well with git diff for PR review.

Source and attribution

From Affaan M’s everything-claude-code — an MIT-licensed skill collection covering harness construction, agent ops, video, payments, and platform-specific patterns.

License: MIT.