AI harness engineering

Guardrails and memory
for your AI coding agent.

AI coding agents can skip verification, "accidentally" run harmful commands, and repeat the same mistakes at the worst time. That's not a prompting issue, it's a harness problem. goat-flow is an opinionated harness for Claude Code, Codex, Gemini CLI, and Copilot CLI.

$ npm install -g goat-flow

Terminal output showing goat-flow audit results: 12 of 12 setup checks passing, harness scores for Claude Code (94%), Codex (91%), Gemini CLI (87%), and Copilot CLI (85%), plus five-concern coverage for context, constraints, verification, recovery, and feedback loop.

Why harness engineering?

The model isn't the product. The harness is.

Every serious practitioner has converged on the same insight: the LLM is commodity, the scaffolding around it isn't. Files it can read, commands it can run, rules it must obey, memory it keeps across sessions. That's the harness. goat-flow gives you one, opinionated, out of the box.

Supports Claude Code Codex Gemini CLI Copilot CLI
The system

Four pieces. One harness.

Audit tells you what's missing. Skills give the agent workflows. Hooks stop dangerous actions. The learning loop remembers what happened.

01 / Audit

Pass/fail checks, no wiggle room

Validates every file, skill, and hook the agent needs. Either it's installed or it isn't. Three scopes: goat-flow setup, per-agent configuration, and harness completeness across the five concerns.

goat-flow audit --harness
02 / Skills

Structured slash commands

Seven workflows with defined phases, named artifacts, and stopping points. Debug, plan, review, critique, security, QA - plus a dispatcher that routes your intent to the right skill.

/goat, /goat-debug, /goat-plan...
03 / Hooks

Safety nets that can't be skipped

Pre- and post-action guards fire before the agent can hurt anything. deny-dangerous ships by default, blocking rm -rf, force-push, secret exfiltration, and six other patterns.

.goat-flow/hooks/
04 / Learning loop

Persistent memory across sessions

Four kinds of records turn every mistake into next session's context. Footguns, lessons, decisions, session logs. The compounding bet: every session that hits a problem makes the next one harder to trip.

.goat-flow/lessons, /footguns, /decisions
Under the hood

The execution loop

Every agent action follows four steps. Each one prevents a specific failure mode that free-running agents reliably hit.

READ

Load the files first

Pull in the actual code before reasoning about it.

Prevents fabrication - inventing APIs that don't exist.
SCOPE

Declare what changes

List files that will be touched, and files that won't.

Prevents scope creep - editing files the task never asked for.
ACT

Make the change

Edit only within the declared scope. Nothing else.

Prevents off-target edits - changes made because they seemed related.
VERIFY

Prove it works

Run linters, re-read changed files, confirm nothing drifted.

Prevents silent breakage - passing the task but breaking the build.
Seven skills

Workflows, not suggestions.

Free-form prompting is how agents get lost. Skills are structured slash commands with defined phases and clear stopping points. Use /goat as the default entry point and it routes to the right one.

/goat-debug Diagnose bugs, explore code, investigate unfamiliar areas Debug
/goat-plan Plan features, refactors, and milestones with complexity routing Plan
/goat-review Review diffs and audit code quality with negative verification Review
/goat-critique Multi-lens critique to surface blind spots before shipping Critique
/goat-security Threat model, dependency audit, and compliance checks Security
/goat-qa Generate test plans with automated, AI-verified, and manual steps QA
Hooks

Block dangerous actions before they run.

Ships with sensible defaults

deny-dangerous catches the patterns agents hit most often when they go off-script: destructive filesystem commands, force-pushes, secret file reads, subshell escapes, and database truncation.

Extend with your own

Drop linters, format-on-save, custom validators, or project-specific rules into the hooks directory. They register automatically and run in parallel with the defaults.

deny-dangerous Pre-action
βœ—rm -rfdestructive
βœ—git push --forcehistory rewrite
βœ—cat .envsecret read
βœ—curl | shexfiltration
βœ—eval, bash -csubshell escape
βœ—DROP TABLEdata loss
βœ—> filetruncation
βœ—$(...)recursive sub
Learning loop

The harness gets smarter every session.

Agents forget everything between runs. Four kinds of persistent records make sure the same mistake doesn't happen twice.

Footguns

Architectural traps captured with file:line evidence. Stops the agent from hitting the same code landmine twice.

.goat-flow/footguns/

Lessons

Behavioural mistakes the agent made - logged so the same error pattern is recognised and avoided next time.

.goat-flow/lessons/

Decisions

Architecture Decision Records. Captures why a choice was made so future agents don't quietly reverse it.

.goat-flow/decisions/

Session logs

End-of-session summaries provide continuity between work sessions - across agents, across days, across context compactions.

.goat-flow/logs/sessions/
Background

The five concerns of AI harness engineering.

The common ground across the public harness engineering literature. goat-flow scores every installed harness against these five.

Context Give the agent a map, not a 1,000-page manual. Concise instructions, the right files, progress notes across sessions.
Constraints Deterministic rules that steer before the agent acts. Linters, deny-hooks, permissions, required sections.
Verification Structural checks the agent runs to prove its own work. Tests, typecheck, post-action hooks, back-pressure.
Recovery Session durability and restart paths. Checkpoint and resume, compaction handlers, milestone checkboxes, loop detection.
Feedback loop Capture every mistake as persistent context so the next session doesn't repeat it. Footguns, lessons, decisions, logs.

Sources: Mitchell Hashimoto, Birgitta BΓΆckeler (martinfowler.com), Anthropic engineering, and HumanLayer. goat-flow synthesises these into a working system with strong defaults, rather than a framework you have to assemble yourself.

Get started

Three commands and you're running.

Install globally, set it up on any project, and start running skills through your agent of choice.

1 npm install -g goat-flow
2 goat-flow setup . --agent claude
3 goat-flow audit --harness

Supports Claude Code, Codex, Gemini CLI, and Copilot CLI. Read the CLI docs β†’