Mitchell Hashimoto calls it harness engineering. Birgitta Böckeler writes about it on martinfowler.com. Anthropic's own engineering posts keep circling back to the same idea. HumanLayer is building products around it. The vocabulary is still settling, but the thing itself is now clear enough to have a definition.

Harness engineering is the practice of shaping what an AI coding agent sees, what it may do, how its work is checked, how state survives failure, and how recurring mistakes become structural fixes.

The case for the harness as the binding constraint has stacked up fast. Can Bölük changed only the edit-tool format on a coding benchmark, using the same model throughout, and watched one configuration go from 6.7% to 68.3%. A ten-fold improvement. No model change. Just a different way of expressing the same action to the same model.

If that doesn't sound like the model is the bottleneck, it's because it isn't.

The industry consensus

In February 2026, OpenAI's Codex team published a post about a five-month experiment: shipping a product to internal users with zero human-written code. About a million lines of code, roughly 1,500 pull requests, three engineers steering agents. Their post-mortem reads like independent confirmation of the harness literature. The same concerns. The same observations:

Give the agent a map, not a 1,000-page instruction manual. Enforce invariants rather than specific implementations. Anything the agent can't see in its context at runtime effectively doesn't exist. - OpenAI Codex team

The team's own summary of why the first few months were slow is worth quoting directly: "Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified."

The same week, LangChain published a harness experiment with harder numbers. They took their coding agent from Top 30 to Top 5 on Terminal Bench 2.0, a 13.7-point jump from 52.8 to 66.5. The model stayed the same: GPT-5.2-Codex, start to finish. Only the harness changed. If you want one empirical data point for "the harness is the product, not the model," that's it.

Stripe is running the pattern at production scale. Their Minions system ships over a thousand pull requests per week through a hybrid architecture: deterministic harness nodes for CI and linting, agentic nodes for implementation, a harness-enforced limit of two CI rounds before escalation to a human, and a curated subset of over 400 MCP tools exposed per agent rather than the full catalogue. The tool curation isn't decoration. Over-tooling degrades performance, so the harness solves it with scope restriction instead of asking the model to restrain itself.

Three teams, three architectures. Same finding: the model is not the binding constraint. The harness is.

What a harness is (and isn't)

A harness is the system around the model. The files it can read. The commands it can run. The rules it must obey. The extended memory it keeps across sessions. It's the thing that turns a language model into an agent you can trust enough to walk away from.

Worth clearing up what a harness is not. Frameworks like LangChain, AutoGen, or LlamaIndex are the SDKs you build an agent with. A harness is the runtime that executes it, enforces the hardcoded boundaries, and persists state. You can build a harness with a framework, without one, or directly on your agent CLI of choice. goat-flow is a harness.

If it helps, think of it as an operating system. The model is the CPU. The context window is the RAM. The harness is the OS. The agent is the application running on top. The model doesn't reason about whether it's allowed to read .env. The OS simply refuses the syscall.

The practitioners actually shipping production agents have converged on the same observation. The LLM is a commodity. Everyone has Claude Code. Everyone has Codex. Gemini CLI. Copilot CLI. If the model is the differentiator, there is no differentiator. What makes one team ship fast and safe with AI while the team next door ships fast and broken? The scaffolding.

Most teams don't know they're doing harness engineering. They think they're writing a CLAUDE.md file and adding a few lint rules. They're right. Those are harness components. The mistake is thinking that's a complete harness.


Three layers of agent development

Harness engineering is the third stop on a road that most teams walk in order.

  1. Prompt engineering (2022-2024): The lever was the text you put in front of the model: phrasing, few-shot examples, chain-of-thought structure. One exchange, one output. The question was: What do I say to get a better answer?
  2. Context engineering (2024-2025): As sessions got longer and tool calls got noisier, the question shifted. Document retrieval, history compression, which tool results to format how, what to drop when the window fills. The question became: What should the model be able to see right now?
  3. Harness engineering (now): Once the model is capable enough to handle a long task but still unreliable in production, the question changes again. It's no longer about the input or the information around the input. It's about the whole operational envelope: tools, guardrails, verification, recovery, and persistent memory. The question now is: What does the model need around it to do real work?

Each layer subsumes the one before it. Prompt engineering lives inside context engineering. Context engineering lives inside harness engineering.


The five concerns

Read enough of the harness engineering literature and five themes come up again and again. These are the core audit lenses used in goat-flow.

Concern Definition Primary failure
Context What the agent reads before it acts Fabrication, prose bloat
Constraints What the agent may never do Destructive or irreversible actions
Verification How work is checked after the agent acts Silent regressions, unverified claims
Recovery How state survives failure Lost plot after compaction or crash
Feedback loop How recurring mistakes become permanent fixes Same bug, different day

1. Context

The map the agent carries into a task. Instructions, files, progress notes from previous sessions. Too thin and the agent fabricates; too fat and it drowns.

A 1,000-line instruction file is a decoy dressed as context. Just as human engineers suffer from a form of "attention residue" when burdened with too much cross-chatter, models suffer when everything is marked important. Agents pattern-match locally instead of navigating intentionally, they miss the key constraint buried in line 847, and the file rots faster than anyone is willing to maintain.

Good context is lean and current. A short entry point that points to deeper sources of truth beats one monolithic manual every time.

2. Constraints

Deterministic rules that steer the agent before it acts. Linters, deny-hooks, required instruction sections, permission boundaries. Anything the model doesn't have to reason about, because the rule fires automatically. Constraints are cheaper than reasoning.

The useful framing here is enforce invariants, not implementations. Require that data is parsed at boundaries, without prescribing which library does the parsing. Require structured logging, without prescribing the log format. The harness encodes the rule; the agent chooses how to satisfy it.

There's a quieter kind of constraint that gets missed: what you don't give the agent. Vercel removed 80% of the tools from its internal d0 data agent and measured task success climb from 80% to 100%. Removing capability improved outcomes. More tools mean more places to get lost. A tight, task-appropriate toolset is the cheapest reliability gain on the table.

3. Verification

Structural checks the agent runs to prove its own work. Typechecking. Re-reading files after editing them. Post-action hooks that refuse to return success unless the build passes.

Models can improve their own work, but they do not start the loop on their own. LangChain's trace analysis found the most common failure mode was the agent writing a solution, re-reading its own code, confirming it looks ok, and stopping. It never wrote a test. The agent was capable of verifying its work. The harness has to force the loop.

Anthropic's long-running harness work points at a related problem: the same agent that built the thing is a weak final judge of it. The fix borrows from GANs: separate the generator from the evaluator, then let adversarial tension force quality up.

There's also the "doom loop." Agents can be myopic once committed to a plan, making 10+ small variations of the same broken approach. Track edit counts per file, inject a "consider reconsidering your approach" flag after N edits, and let the agent escape its own tunnel. The harness supplies the self-awareness the model lacks.

4. Recovery

Session durability. What happens when the context window fills? When the agent crashes mid-task? A harness without recovery loses progress the moment anything interesting happens.

Recovery also covers a weirder failure mode. The Cognition team documented a phenomenon called "context anxiety." As the context window fills, the model becomes aware of the approaching limit and starts taking shortcuts, wrapping up tasks prematurely. Their fix was pure harness engineering: enable the 1M-token beta, cap actual usage at 200K, and trick the model into believing it has ample runway. The anxiety vanished. No model change required, just a smarter environment.

5. Feedback loop

Persistent memory that turns every mistake into next session's input. Footguns. Lessons. Decisions. Session logs. Without this, every Tuesday the agent writes the same function in a new file, and every Tuesday you catch it.

Two corollaries:

  1. If it isn't in-context, it doesn't exist. An architectural agreement reached in a Slack thread or a senior engineer's head is invisible to a fresh session. If it isn't persisted into the repo, the agent doesn't have it.
  2. Persisted memory rots. Lessons go stale. A mature harness includes something that cleans up its own memory, a periodic garden pass that consolidates or removes entries that no longer reflect reality.

Why this compounds

Harness engineering compounds in two directions.

First, across sessions. A good model without a harness gets you a good session. A good model with a good harness gets you a good session, and every next session is a little better, because the feedback loop captured what went wrong and the extended memory stored what was decided.

Second, across model releases. Teams that over-engineer hand-coded control flow into their agents watch their systems break on every upgrade. A thin harness abstracts the infrastructure away from the model. When the next major model drops, you swap the CPU without rewriting the OS.

This is why opinionated, minimal harnesses beat thick framework-heavy ones. The more you build into the harness, the more that has to survive the next model release.

But there is nuance: principles transfer, tunings don't. Structural changes (better verification loops, tighter context management) transfer cleanly across models. Exact prompt wording and precise tool descriptions do not. LangChain's configuration that scored 66.5% with Codex scored 59.6% with Claude Opus running the exact same setup. Every new model gets its own round of harness iteration. Budget for it.

Every component in a well-built harness is waiting to be made redundant by a smarter model. Each one encodes a current model limitation. The first question after any major model upgrade is what can be removed. Build components that are designed to be deleted.

The harness is what you control across model releases, prompt fashions, tool changes, and platform shifts. It's where the craft lives.


Sources


How goat-flow implements the five concerns

goat-flow is the opinionated harness for teams shipping with Claude Code, Codex, Gemini CLI, and Copilot CLI. It implements the five concerns as a single CLI that audits, scores, and installs everything an agent needs to operate reliably.

The CLI audits every installed harness and scores it across the five concerns, so you know exactly where the gaps are before the agent starts working.