# agent-harness-construction

> A Claude Code skill from Affaan M's everything-claude-code repo that codifies the four constraints on agent output quality — action space, observation, recovery, context budget — with rules for tool granularity (micro for high-risk, macro for round-trip-dominated), observation shape (status / summary / next_actions / artifacts), error-recovery contract, and ReAct vs function-calling pattern picks.

**Use case**: Design agent tools and observations so the agent has a fighting chance of finishing the task

**Canonical URL**: https://agentcookbooks.com/skills/agent-harness-construction/

**Topics**: claude-code, skills, agents

**Trigger phrases**: "design an agent tool space", "what should tool outputs look like", "improve my agent's completion rate"

**Source**: [Affaan M](https://github.com/affaan-m/everything-claude-code/tree/main/skills/agent-harness-construction)

**License**: MIT

---

## What it does

`agent-harness-construction` is the agent-design skill in [Affaan M's everything-claude-code](https://github.com/affaan-m/everything-claude-code) — see [skills/agent-harness-construction](https://github.com/affaan-m/everything-claude-code/tree/main/skills/agent-harness-construction). It frames agent output quality as constrained by four things — action space quality, observation quality, recovery quality, context budget quality — and ships rules for each.

Action space: stable explicit tool names, schema-first narrow inputs, deterministic output shapes, no catch-all tools. Granularity: micro-tools for high-risk operations (deploy, migration, permissions), medium tools for common edit/read/search loops, macro tools only when round-trip overhead dominates. Observations get a four-field shape — `status` (success/warning/error), `summary` (one-line result), `next_actions` (actionable follow-ups), `artifacts` (file paths or IDs). Error recovery requires three things per error path: root cause hint, safe retry instruction, explicit stop condition.

Architecture pattern guidance: ReAct for exploratory tasks with uncertain paths, function-calling for structured deterministic flows, hybrid (recommended) — ReAct planning with typed tool execution. Benchmarks to track: completion rate, retries per task, pass@1 and pass@3, cost per successful task. Anti-patterns called out explicitly: too many tools with overlapping semantics, opaque tool output without recovery hints, error-only output without next steps, context overloading with irrelevant references.

## When to use it

- Designing a new agent — what tools to expose, what observation shape to enforce, how to handle errors
- Diagnosing an agent that doesn't converge — too many tools? bad observations? no recovery contract?
- Choosing between ReAct and function-calling architecture for a new flow
- Setting up agent benchmarks (completion rate, retries, pass@k, cost-per-success)
- Code-review for agent harness changes — does the tool addition violate granularity rules?

When *not* to reach for it:

- Implementing the agent in code — this is patterns and constraints, not a framework choice
- Comparing existing coding agents — that's `agent-eval`
- Building specific agent personas — that's a different family
- One-off prompt tweaks — the skill is about durable structure, not prompt-engineering tactics

## Install

From [affaan-m/everything-claude-code](https://github.com/affaan-m/everything-claude-code) at `skills/agent-harness-construction/`. Drop the folder into `~/.claude/skills/agent-harness-construction/`. The skill is a constraints + patterns reference; there's no separate runtime dependency. Pair with `agent-eval` to measure the impact of harness changes on real tasks.

## What a session looks like

1. **State the agent goal.** "I'm building an agent that runs migrations against staging and production databases."
2. **Pick granularity.** Migrations are high-risk → micro-tools. Don't expose `run_migration(env)` — expose `dry_run_migration(env, migration_id)`, `apply_migration(env, migration_id)`, `rollback_migration(env, migration_id)` separately.
3. **Define observation shape.** Each tool returns `status`, `summary`, `next_actions`, `artifacts`. A failed dry-run gets `status: warning`, a one-line summary, `next_actions: ["fix-syntax-error", "rerun-dry-run"]`, and the migration file path as the artifact.
4. **Write the error-recovery contract.** For each error path: root cause hint ("syntax error at line 42"), safe retry instruction ("fix the SQL then call dry_run_migration"), stop condition ("if dry-run fails three times, escalate to operator").
5. **Pick the architecture.** Migration flow is structured and deterministic → function-calling with a hybrid planner for the "pick which migration" step.
6. **Set the benchmark targets.** Completion rate ≥ 95% on a fixed task set. pass@1 ≥ 80%. Retries per task ≤ 1.5. Cost per successful task ≤ $0.10.

The discipline that makes it work: design before code. The four-constraint frame is a checklist — going to code without naming the action space, observation shape, and recovery contract leaves the agent in the "trying to find any error message that helps" failure mode the skill is built to prevent.

## Receipts

_TODO — to be filled in from a real session. Once an agent has been built using these constraints, this section will capture: which of the four constraints (action space / observation / recovery / context) drove the most design decisions, whether the four-field observation shape actually surfaced enough next-action signal for the agent to converge, the measured completion rate / retries / cost-per-success on a fixed task set after the design was applied, and which anti-pattern showed up most often in the pre-skill version of the harness._

## Source and attribution

From [Affaan M's everything-claude-code](https://github.com/affaan-m/everything-claude-code/tree/main/skills/agent-harness-construction) — an MIT-licensed skill collection covering harness construction, agent ops, video, payments, and platform-specific patterns.

License: MIT.

Quoting the core model verbatim: *"Agent output quality is constrained by: 1. Action space quality. 2. Observation quality. 3. Recovery quality. 4. Context budget quality."* The four-item frame is the wedge — most agent-design conversations focus on prompts, but the prompt is rate-limited by these four upstream constraints.