Who coined the term harness engineering?

Mitchell Hashimoto coined the term 'harness engineering' in February 2026, describing the practice of engineering the environment around an AI coding agent so that it never makes the same mistake twice.

What are the five concerns of harness engineering?

The five concerns are: Context (what the agent reads before acting), Constraints (what the agent may never do), Verification (how work is checked after acting), Recovery (how state survives failure), and Feedback Loop (how mistakes become permanent fixes).

What is the difference between a system prompt and a harness?

A system prompt is a suggestion the model can ignore under pressure. A harness enforces rules structurally at the tool layer, where the model cannot bypass them. A deny hook that blocks 'git push' is a guarantee; the same rule in a markdown file is guidance.

Do I need harness engineering if I only use one AI coding agent?

Yes. Harness engineering applies regardless of which agent you use. Claude Code, Codex, Gemini CLI, and Copilot CLI all share the same failure modes: they fabricate APIs, forget yesterday's context, and confidently act on bad premises without verification. A harness addresses these structurally.

Harness Engineering: AI coding agent guardrails, memory, and workflows

Mitchell Hashimoto calls it harness engineering. Birgitta Böckeler writes about it on martinfowler.com. Anthropic's own engineering posts keep circling back to the same idea. HumanLayer is building products around it. The vocabulary is still settling, but the thing itself is now clear enough to have a definition.

Harness engineering is the practice of shaping what an AI coding agent sees, what it may do, how its work is checked, how state survives failure, and how recurring mistakes become structural fixes.

The case for the harness as the binding constraint has stacked up fast. Can Bölük changed only the edit-tool format on a coding benchmark, using the same model throughout, and watched one configuration go from 6.7% to 68.3%. A ten-fold improvement. No model change. Just a different way of expressing the same action to the same model.

If that doesn't sound like the model is the bottleneck, it's because it isn't.

The industry consensus

In February 2026, OpenAI's Codex team published a post about a five-month experiment: shipping a product to internal users with zero human-written code. About a million lines of code, roughly 1,500 pull requests, three engineers steering agents. Their post-mortem reads like independent confirmation of the harness literature. The same concerns. The same observations:

Give the agent a map, not a 1,000-page instruction manual. Enforce invariants rather than specific implementations. Anything the agent can't see in its context at runtime effectively doesn't exist. - OpenAI Codex team

The team's own summary of why the first few months were slow is worth quoting directly: "Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified."

The same week, LangChain published a harness experiment with harder numbers. They took their coding agent from Top 30 to Top 5 on Terminal Bench 2.0, a 13.7-point jump from 52.8 to 66.5. The model stayed the same: GPT-5.2-Codex, start to finish. Only the harness changed. If you want one empirical data point for "the harness is the product, not the model," that's it.

Stripe is running the pattern at production scale. Their Minions system ships over a thousand pull requests per week through a hybrid architecture: deterministic harness nodes for CI and linting, agentic nodes for implementation, a harness-enforced limit of two CI rounds before escalation to a human, and a curated subset of over 400 MCP tools exposed per agent rather than the full catalogue. The tool curation isn't decoration. Over-tooling degrades performance, so the harness solves it with scope restriction instead of asking the model to restrain itself.

Three teams, three architectures. Same finding: the model is not the binding constraint. The harness is.

What a harness is (and isn't)

A harness is the system around the model. The files it can read. The commands it can run. The rules it must obey. The extended memory it keeps across sessions. It's the thing that turns a language model into an agent you can trust enough to walk away from.

Worth clearing up what a harness is not. Frameworks like LangChain, AutoGen, or LlamaIndex are the SDKs you build an agent with. A harness is the runtime that executes it, enforces the hardcoded boundaries, and persists state. You can build a harness with a framework, without one, or directly on your agent CLI of choice. goat-flow is a harness.

If it helps, think of it as an operating system. The model is the CPU. The context window is the RAM. The harness is the OS. The agent is the application running on top. The model doesn't reason about whether it's allowed to read .env. The OS simply refuses the syscall.

The practitioners actually shipping production agents have converged on the same observation. The LLM is a commodity. Everyone has Claude Code. Everyone has Codex. Gemini CLI. Copilot CLI. If the model is the differentiator, there is no differentiator. What makes one team ship fast and safe with AI while the team next door ships fast and broken? The scaffolding.

Most teams don't know they're doing harness engineering. They think they're writing a CLAUDE.md file and adding a few lint rules. They're right. Those are harness components. The mistake is thinking that's a complete harness.

Three layers of agent development

Harness engineering is the third stop on a road that most teams walk in order.

Prompt engineering (2022-2024): The lever was the text you put in front of the model: phrasing, few-shot examples, chain-of-thought structure. One exchange, one output. The question was: What do I say to get a better answer?
Context engineering (2024-2025): As sessions got longer and tool calls got noisier, the question shifted. Document retrieval, history compression, which tool results to format how, what to drop when the window fills. The question became: What should the model be able to see right now?
Harness engineering (now): Once the model is capable enough to handle a long task but still unreliable in production, the question changes again. It's no longer about the input or the information around the input. It's about the whole operational envelope: tools, guardrails, verification, recovery, and persistent memory. The question now is: What does the model need around it to do real work?

Each layer subsumes the one before it. Prompt engineering lives inside context engineering. Context engineering lives inside harness engineering.

The five concerns

Read enough of the harness engineering literature and five themes come up again and again. These are the core audit lenses used in goat-flow.

Concern	Definition	Primary failure
Context	What the agent reads before it acts	Fabrication, prose bloat
Constraints	What the agent may never do	Destructive or irreversible actions
Verification	How work is checked after the agent acts	Silent regressions, unverified claims
Recovery	How state survives failure	Lost plot after compaction or crash
Feedback loop	How recurring mistakes become permanent fixes	Same bug, different day

1. Context

The map the agent carries into a task. Instructions, files, progress notes from previous sessions. Too thin and the agent fabricates; too fat and it drowns.

A 1,000-line instruction file is a decoy dressed as context. Just as human engineers suffer from a form of "attention residue" when burdened with too much cross-chatter, models suffer when everything is marked important. Agents pattern-match locally instead of navigating intentionally, they miss the key constraint buried in line 847, and the file rots faster than anyone is willing to maintain.

Good context is lean and current. A short entry point that points to deeper sources of truth beats one monolithic manual every time.

2. Constraints

Deterministic rules that steer the agent before it acts. Linters, deny-hooks, required instruction sections, permission boundaries. Anything the model doesn't have to reason about, because the rule fires automatically. Constraints are cheaper than reasoning.

The useful framing here is enforce invariants, not implementations. Require that data is parsed at boundaries, without prescribing which library does the parsing. Require structured logging, without prescribing the log format. The harness encodes the rule; the agent chooses how to satisfy it.

There's a quieter kind of constraint that gets missed: what you don't give the agent. Vercel removed 80% of the tools from its internal d0 data agent and measured task success climb from 80% to 100%. Removing capability improved outcomes. More tools mean more places to get lost. A tight, task-appropriate toolset is the cheapest reliability gain on the table.

3. Verification

Structural checks the agent runs to prove its own work. Typechecking. Re-reading files after editing them. Post-action hooks that refuse to return success unless the build passes.

Models can improve their own work, but they do not start the loop on their own. LangChain's trace analysis found the most common failure mode was the agent writing a solution, re-reading its own code, confirming it looks ok, and stopping. It never wrote a test. The agent was capable of verifying its work. The harness has to force the loop.

Anthropic's long-running harness work points at a related problem: the same agent that built the thing is a weak final judge of it. The fix borrows from GANs: separate the generator from the evaluator, then let adversarial tension force quality up.

There's also the "doom loop." Agents can be myopic once committed to a plan, making 10+ small variations of the same broken approach. Track edit counts per file, inject a "consider reconsidering your approach" flag after N edits, and let the agent escape its own tunnel. The harness supplies the self-awareness the model lacks.

4. Recovery

Session durability. What happens when the context window fills? When the agent crashes mid-task? A harness without recovery loses progress the moment anything interesting happens.

Recovery also covers a weirder failure mode. The Cognition team documented a phenomenon called "context anxiety." As the context window fills, the model becomes aware of the approaching limit and starts taking shortcuts, wrapping up tasks prematurely. Their fix was pure harness engineering: enable the 1M-token beta, cap actual usage at 200K, and trick the model into believing it has ample runway. The anxiety vanished. No model change required, just a smarter environment.

5. Feedback loop

Persistent memory that turns every mistake into next session's input. Footguns. Lessons. Decisions. Session logs. Without this, every Tuesday the agent writes the same function in a new file, and every Tuesday you catch it.

Two corollaries:

If it isn't in-context, it doesn't exist. An architectural agreement reached in a Slack thread or a senior engineer's head is invisible to a fresh session. If it isn't persisted into the repo, the agent doesn't have it.
Persisted memory rots. Lessons go stale. A mature harness includes something that cleans up its own memory, a periodic garden pass that consolidates or removes entries that no longer reflect reality.

Why this compounds

Harness engineering compounds in two directions.

First, across sessions. A good model without a harness gets you a good session. A good model with a good harness gets you a good session, and every next session is a little better, because the feedback loop captured what went wrong and the extended memory stored what was decided.

Second, across model releases. Teams that over-engineer hand-coded control flow into their agents watch their systems break on every upgrade. A thin harness abstracts the infrastructure away from the model. When the next major model drops, you swap the CPU without rewriting the OS.

This is why opinionated, minimal harnesses beat thick framework-heavy ones. The more you build into the harness, the more that has to survive the next model release.

But there is nuance: principles transfer, tunings don't. Structural changes (better verification loops, tighter context management) transfer cleanly across models. Exact prompt wording and precise tool descriptions do not. LangChain's configuration that scored 66.5% with Codex scored 59.6% with Claude Opus running the exact same setup. Every new model gets its own round of harness iteration. Budget for it.

Every component in a well-built harness is waiting to be made redundant by a smarter model. Each one encodes a current model limitation. The first question after any major model upgrade is what can be removed. Build components that are designed to be deleted.

The harness is what you control across model releases, prompt fashions, tool changes, and platform shifts. It's where the craft lives.

Sources

Mitchell Hashimoto - "My AI Adoption Journey" (Feb 2026)
Coined "harness engineering." Established the core principle: when an agent makes a mistake, engineer a solution so it never makes that mistake again.
OpenAI - "Harness engineering: leveraging Codex in an agent-first world" (Feb 2026)
Five-month, zero-human-code experiment. About a million lines, roughly 1,500 PRs. Custom linters, structural tests, error messages with remediation.
Birgitta Böckeler - "Harness engineering for coding agent users" on martinfowler.com (Apr 2026)
Feedforward/feedback taxonomy. Computational controls that steer the agent before it acts and observe after.
Anthropic Engineering - "Scaling Managed Agents" (Apr 2026)
Brain/hands decoupling. Session durability and checkpoint-resume with external event logs.
LangChain - Terminal Bench 2.0 harness experiment (Feb 2026)
Top 30 to Top 5 with harness changes alone. 13.7-point jump, same model throughout.
HumanLayer (Kyle) - "Skill Issue: Harness Engineering for Coding Agents" (Mar 2026)
Most practical configuration guide. Agent success correlates with verification ability, not production ability.
Han Heloir Yan - "Anthropic Just Shipped Three of the Five Harness Layers" (Apr 2026)
Five-layer stack synthesis. L1 Constraint as "the highest marginal return on a managed platform."
Can Bölük - "I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed." (Feb 2026)
Hashline edit-tool benchmark. Grok Code Fast 1 moved from 6.7% to 68.3% with the model held fixed.
Vercel - "We removed 80% of our agent's tools" (Dec 2025)
Internal d0 agent simplification. Success rate moved from 80% to 100% while token usage and steps dropped.
Stripe - "Minions: Stripe's one-shot, end-to-end coding agents" (Feb 2026)
Production coding-agent architecture with over a thousand Minion-produced pull requests per week.
Anthropic Engineering - "Harness design for long-running application development" (Mar 2026)
Planner, generator, and evaluator harness pattern for multi-hour autonomous coding sessions.
Cognition - "Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges" (Sep 2025)
Documents context anxiety and the 1M-token beta with 200K actual-usage cap mitigation.

How goat-flow implements the five concerns

goat-flow is the opinionated harness for teams shipping with Claude Code, Codex, Gemini CLI, and Copilot CLI. It implements the five concerns as a single CLI that audits, scores, and installs everything an agent needs to operate reliably.

Context - hot-path router with cold-path domain docs. READ → SCOPE → ACT → VERIFY execution loop enforced in every instruction file.
Constraints - deny-dangerous hook blocks destructive commands at the tool-call layer. Covers rm -rf, all git push, secret reads, subshell escapes, and database truncation.
Verification - post-action hooks, typecheck/lint gates, and structured skills with built-in stopping points before the agent moves on.
Recovery - milestone files with trackable checkboxes, session logs, and compaction-safe artefacts the agent reads on resume.
Feedback loop - footguns/, lessons/, and decisions/ directories with semantic-anchor evidence. Every mistake becomes next session's context.

The CLI audits every installed harness and scores it across the five concerns, so you know exactly where the gaps are before the agent starts working.

Harness Engineering: guardrails, memory, and workflows for AI coding agents