Harness Engineering: The Invisible Force That Makes AI Agents Production-Ready
If you've spent any real time wiring up an AI coding agent — debugging weird failures, reading traces, wondering why the model did something obviously wrong — you've probably had the same realization I did: the model isn't the bottleneck.
The framing that finally made it click for me came from Addy Osmani's writing on agent harness engineering, which itself draws on a line from Viv Trivedy:
"Agent = Model + Harness. If you're not the model, you're the harness."
And the corollary that follows from it:
"A decent model with a great harness beats a great model with a bad harness."
A lot of this post is my attempt to internalize and extend that framework, with the components I've found most worth understanding, the mistakes I keep watching people make, and how I'd approach learning this from scratch today.
What Even Is a Harness?
Here's the cleanest mental model I've found: Agent = Model + Harness.
The model is the LLM — GPT-4o, Claude, Gemini, whatever. It does the reasoning.
The harness is everything else:
- The system prompt and documentation files (AGENTS.md, CLAUDE.md)
- The tools the agent can call — bash, file system, browser, MCP servers
- The execution environment — sandboxed or not, what permissions exist
- The orchestration logic — how subagents are spawned and coordinated
- The hooks — scripts that intercept tool calls and enforce constraints
- The memory system — how context is managed across a long session
- The observability layer — logging, tracing, cost metering
Most people building agents obsess over prompt engineering and model selection. The harness is what they treat as boilerplate. That's the mistake.
The harness is the product. The model is a commodity that gets swapped out.
The Mindset Shift That Changes Everything
Here's something I had to unlearn: when an agent fails, my first instinct used to be "this model isn't good enough." That framing is almost always wrong.
The right framing is: agent mistakes are configuration problems.
When the agent deletes something it shouldn't, that's a permissions problem — your harness didn't restrict destructive commands. When it goes in circles, that's a context problem — your harness didn't enforce checkpointing. When it produces the wrong output format, that's a prompt problem — your harness's AGENTS.md doesn't encode that constraint clearly enough.
Osmani calls this the "skill issue" reframe, and it matters because it tells you what to do next:
- Agent made a wrong assumption → add a convention to your documentation
- Agent tried to run
rm -rf→ add a hook that blocks it - Agent got lost on a long task → refactor how you decompose tasks
- Agent hit a dead end and stalled → wire back-pressure signals into the loop
Every failure is actionable. Every mistake is a signal that the harness needs a tighter constraint.
The Ratchet Principle
Related to that — and this is one of Osmani's framings I think more people should adopt — every agent mistake should become a permanent signal. He calls it the Ratchet Principle: it only moves in one direction. When something breaks, you encode the fix. Not in your memory. In the harness.
The best AGENTS.md files I've seen are sparse, specific, and every single line traces back to something that went wrong in the past. There's no fluff, no aspirational rules about "think step by step." Every line earned its place through a real failure.
If your AGENTS.md has 200 lines of generic advice about "being careful" — that's a sign the harness grew through speculation rather than observation. Start over. Keep only what you can trace to a real incident.
Under 60 lines is a good target. If you need more, your task decomposition probably needs work.
The Core Harness Components
Let me go through the actual building blocks. These are the things you need to understand to engineer a harness well.
Filesystem and Git
The filesystem is the foundational primitive. Everything else builds on it.
Durable state without token overhead. The agent can write a plan to disk, come back to it later, branch work across files, commit progress incrementally. Without this, you're trying to run long-horizon tasks in a sliding context window — and you will lose.
Git is the other half. It enables rollback, diff-based verification, and the ability for the agent to reason about what changed. A good harness uses git aggressively: commit after each meaningful unit of work, not just at the end.
Tools and Bash Access
A general-purpose bash shell beats a collection of specialized pre-built tools almost every time. Here's why: specialized tools create seams. The agent has to translate its intent into the tool's interface. A shell lets it compose arbitrary commands to solve arbitrary problems.
That said, bash access requires the harness to compensate with constraints. You want the agent to be able to do almost anything — except the things that are catastrophically irreversible. That's what hooks are for.
Sandboxes
Isolation is load-bearing. An agent with full system access is one mistake away from something that can't be undone. A sandboxed agent can explore freely because the blast radius is bounded.
Good sandboxes come pre-loaded: language runtimes, git, browser automation, network access with egress controls. The agent doesn't think about the sandbox — it just runs. The harness handles the containment.
The Hook System
Hooks are scripts that run at lifecycle points — before a tool call, after an edit, before a commit. They're where the harness enforces invariants that shouldn't be negotiable.
A few representative examples of what these look like in practice:
# Block destructive commands at the tool level
if echo "$COMMAND" | grep -qE "rm -rf|DROP TABLE|truncate"; then
echo "BLOCKED: This command requires explicit human approval"
exit 1
fi
# Run typecheck after every edit
npx tsc --noEmit 2>&1
if [ $? -ne 0 ]; then
echo "TYPE ERROR: Fix before proceeding"
exit 1
fi
# Verify tests pass before commit
npm test 2>&1
if [ $? -ne 0 ]; then
echo "TESTS FAILED: Cannot commit broken code"
exit 1
fi
The principle: success is silent, failures are verbose. A passing hook produces no output. A failing hook screams, because the agent needs unambiguous signal to course-correct.
Hooks turn soft expectations into hard constraints. They're the difference between "the agent should check types" and "the agent cannot proceed without passing types."
Memory and Context Management
This is where most agent harnesses fall apart in practice. Context windows are finite. Long-horizon tasks aren't.
Here are the techniques that actually work:
Compaction — When the context window fills, you don't just drop old tokens. You summarize them intelligently, preserving decisions and discoveries while dropping the raw intermediate work that got you there.
Tool-call offloading — Large outputs (logs, file contents, command output) go to disk immediately. The agent gets a path, not the content. This keeps your context full of meaningful tokens instead of megabytes of output it already processed.
Progressive disclosure — Don't load all instructions and tools at the start. Surface them based on what phase of the task the agent is in. A harness that loads everything upfront is burning context on capabilities the agent won't need for another hour.
Context resets — For truly long tasks (multi-hour, multi-session), don't try to maintain a single running context. Build handoff files — compact summaries of state, decisions made, work remaining — and restart into them. The new session picks up from the handoff, not from a bloated and degraded history.
Orchestration: Subagents and Planner/Evaluator Splits
Single-agent architectures have a fundamental limitation: the same model that produces work also evaluates it. That's like grading your own exam — the bias runs deep.
Better pattern: split planning, generation, and evaluation across separate agents.
Planner Agent
│ Decomposes goal into steps
│ Writes sprint contract ("done" conditions)
│
Generator Agent
│ Executes individual steps
│ Writes output to disk
│
Evaluator Agent
Reviews generator's output against contract
Returns pass/fail with specific feedback
If fail → Generator re-runs with feedback injected
The evaluator sees the contract and the output, not the generation process. That distance is what makes it useful.
Sprint contracts matter here: before any work begins, the planner negotiates a precise definition of "done" — what outputs exist, what tests pass, what constraints are satisfied. This prevents the most common long-horizon failure: the agent deciding it's done when it isn't.
Long-Horizon: Ralph Loops
For tasks that span hours or even days, you need a mechanism to sustain execution across context resets.
The pattern I've seen called "Ralph Loops" works like this: a hook intercepts the agent's attempt to signal completion, writes the current state to a handoff file, terminates the session cleanly, and spawns a new session initialized from the handoff. The agent never runs for more than a single context window — but the task continues across windows without losing progress.
This is the harness doing something the model can't do for itself: managing its own lifecycle.
What a Production Harness Actually Looks Like
Here's a rough architecture of how a mature agent harness layers together:
┌─────────────────────────────────────────────────┐
│ INPUT LAYER │
│ Session mgmt · Permission gates · UI │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ KNOWLEDGE LAYER │
│ Skill registry · Context compression │
│ Memory stores · AGENTS.md │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ ORCHESTRATION LAYER │
│ Planner · Generator · Evaluator │
│ Subagent spawning · Routing │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ EXECUTION LAYER │
│ Tool dispatch · Hook enforcement │
│ Sandbox runtime · Prompt cache │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ OBSERVABILITY LAYER │
│ Event bus · Tracing · Cost metering │
│ Background execution · Audit log │
└─────────────────────────────────────────────────┘
Every layer is part of the harness. None of this is the model. And if any layer is missing or weak, you feel it in production.
Why the Harness Gap Is the Real Gap
The difference between what today's AI models can do and what you actually see them doing is largely a harness gap.
The models are more capable than most demos suggest. They fail not because they can't reason about the problem, but because:
- They weren't given the right context at the right time
- There was no enforcement mechanism to prevent the wrong action
- They couldn't persist state across a long task
- They had no reliable way to know when they were done
- The evaluation of their work was done by themselves
Every one of those is a harness problem.
This is also why the top coding agents — Claude Code, Cursor, Codex, Aider — increasingly resemble each other despite running different underlying models. The convergence isn't happening at the model layer. It's happening at the harness layer. The industry is discovering which scaffolding patterns are actually load-bearing, and they're the same patterns regardless of which model you're wrapping.
The Harness-as-a-Service Shift
The industry is in the middle of a platform shift. A year ago, building an agent meant calling the completions API and writing glue code. Now it means choosing a harness runtime and configuring it.
Claude Agent SDK, OpenAI Agents SDK, Codex SDK — these are harness frameworks. They give you execution loops, tool calling, context management, hook systems, and sandbox primitives out of the box. Your job is to configure the domain-specific layer on top.
This is the right abstraction. You shouldn't be rebuilding the context compaction logic or the tool dispatch mechanism — those are solved problems. You should be focused on:
- What tools does your domain need?
- What constraints must be enforced at the hook layer?
- How should context be managed for your specific task horizon?
- What does "done" look like for your specific use case?
The harness is becoming a platform. Harness engineering is the skill of working effectively on that platform.
The Co-Training Loop
Here's something that doesn't get talked about enough: models and harnesses co-evolve.
When a harness pattern proves useful — filesystem operations, planning into disk, subagent dispatch — that pattern gets standardized. Next-generation models get post-training specifically optimized for those actions. The model gets better at using those primitives. The harness patterns that were compensating for model weaknesses can be simplified. New capabilities unlock new harness patterns. Repeat.
This means harness engineering isn't a fixed target. The harness that works well with Claude 3 Sonnet is not the same as the harness that works well with Claude Sonnet 4.6. When a model gets better at something, you remove the scaffolding you were using to compensate. When it unlocks new capabilities, you build new scaffolding to take advantage.
A harness is a living system. It has to be.
How I'd Learn Harness Engineering Today
If I were starting from scratch, this is the path I'd take:
Step 1: Run Claude Code or a comparable agent on a real task and read the traces. Don't just watch it succeed. Watch it fail. Look at what it tried, where it got stuck, what information it was missing. That's your first lesson in what a harness needs to provide.
Step 2: Write your first AGENTS.md. Pick a project. Start with a blank file. Run the agent on a few tasks. Every time it does something wrong, add one line to AGENTS.md encoding the constraint it violated. After ten tasks, you have your first real harness document.
Step 3: Write a hook. Start with something simple — a pre-commit hook that runs your linter. Understand the lifecycle: when does the hook run, what input does it receive, what does the exit code mean. Then add a hook that blocks a class of dangerous commands.
Step 4: Build a multi-agent pipeline. Take a task you've been running with a single agent and split it. Planner writes a spec to disk. Generator executes against the spec. Evaluator reads both and returns a verdict. Wire them together with a simple script. This teaches you more about context management and state handoff than any tutorial will.
Step 5: Add observability. Log everything: tool calls, hook results, context sizes, costs. You can't improve what you can't see. A structured log that captures the agent's full session is invaluable when something goes wrong.
Step 6: Read the docs and the SDKs. The Claude Agent SDK, OpenAI Agents SDK, and Codex SDK are open or well-documented. Their hook lifecycles, permission models, and context strategies are worth studying not to copy but to understand the trade-offs being made. The Claude Code docs on hooks and the broader agent SDK documentation are particularly concrete.
Resources I'd point at:
- Addy Osmani's piece on agent harness engineering — the clearest articulation of the concept I've come across
- The Claude Code hooks documentation — implementation-focused and specific
- Public AGENTS.md / CLAUDE.md files in open-source repos — real-world examples of constraints that earned their place
- Building something small and reading your own traces — there's no substitute
What Gets Easier, What Gets Harder
As models improve, harnesses don't disappear — they evolve.
The scaffolding you built to handle a model's weaknesses becomes unnecessary as those weaknesses are addressed. Osmani gives a good example: Claude Opus 4.6 largely eliminated the "context anxiety" failures earlier models showed — where the agent would rush toward completion as the context window filled. The harness layers built specifically to mitigate that failure mode can now be simplified or removed.
But stronger models unlock longer horizons. And longer horizons demand new harness patterns: multi-day memory policies, multi-agent coordination protocols, harnesses that can orchestrate dozens of parallel agents on the same codebase.
The open problems in harness engineering right now are genuinely hard:
- How do you orchestrate many parallel agents on a shared codebase without clobber conflicts?
- How do you build harnesses that analyze their own traces to identify and fix configuration failures automatically?
- How do you assemble tools and context dynamically — just-in-time — instead of pre-configuring everything upfront?
That last one is particularly interesting. A harness that assembles itself based on what the task actually needs at runtime is closer to a compiler than a config file. We're not there yet, but the trajectory is clear.
The Bet Worth Making
I think harness engineering is one of the highest-leverage skills in software right now. Not because it's flashy, but because it's where the real leverage is.
Model capabilities are improving fast, but they're improving uniformly across everyone who uses the same models. The harness is where individual engineering judgment compounds. The gap between a team that treats agent failures as configuration problems and one that treats them as model limitations — that gap widens every quarter.
The model is a commodity. The harness is the moat.
If I were advising someone on where to invest their learning time in the AI engineering space, I'd say: understand the models well enough to use them effectively, but spend your energy on the harness. That's where the durable skills are building up.
Most of the framework here — Agent = Model + Harness, the Skill Issue reframe, the Ratchet Principle, the HaaS lens — comes from Addy Osmani's writing on agent harness engineering, which I'd recommend reading in full. What I've tried to do here is distill it, extend it where I have something to add, and frame it the way I'd want it explained to me when I first started building agents. If this gave you something useful — go build a small harness and break it. That's where the rest of the learning lives.