iSkill: Designing a Harness for Autonomous AI Agents That Actually Ship

Anthropic published a blog post in March 2026 on harness design for long-running AI development — the idea that autonomous coding agents need more than a good model. They need structured orchestration: a planner that decomposes work, a builder that executes, and an evaluator that catches what the builder misses. The architecture is GAN-inspired — adversarial pressure between builder and evaluator produces better output than either alone.

I've been building real systems around this pattern. iSkill is a suite of Claude Code skills that implements this harness as a repeatable engineering methodology. Not a framework you install — a set of slash commands that turn any project into a structured, build-evaluate-fix pipeline with deterministic outcomes.

The question that drove this work: can you make AI agents reliable enough that you'd trust them to ship production code without a human reviewing every line? The answer is yes — but only if you stop treating the agent as a solo developer and start treating it as a system of specialized agents with adversarial incentives.

The problem with single-agent development

Watch how most people use AI coding assistants today. They describe a feature, the AI writes code, they eyeball it, maybe run it, and move on. This works for small tasks. It fails catastrophically for anything substantial, for a predictable reason: the agent that writes the code is the same agent that judges whether it's correct.

LLMs are generous self-evaluators. Ask Claude to write a function and then ask it to review that function — it will find the code excellent. This isn't a model flaw. It's a structural problem. The builder has context about its own intentions that makes it blind to its own mistakes. The same cognitive bias exists in human developers, which is why code review exists.

The fix isn't better prompting. It's separation of concerns: the agent that builds should not be the agent that evaluates.

The harness architecture

iSkill implements five skills that form a complete pipeline:

/spec-interview  →  Detailed spec (interactive interview)
       ↓
/senna-criteria  →  Grading criteria for this project
       ↓
/senna-plan      →  Flat atomic tasks with dependencies
       ↓
/pre-race        →  Builder + Evaluator negotiate contract
       ↓
/senna-drive     →  Build with GAN-style evaluate→fix loop

Each skill is a Claude Code slash command — a markdown file that expands into a structured prompt when invoked. They compose into a pipeline but work independently. You can use /senna-drive without /pre-race, or /senna-qa without the rest of the harness.

/spec-interview — Turning vague ideas into executable specs

Most project failures trace back to ambiguous specifications. "Build a blog" means different things to different people. /spec-interview conducts a structured interview, asking specific questions about scope, requirements, edge cases, and non-requirements. The output is a spec document detailed enough to decompose into tasks.

The key design decision: the interview is convergent, not divergent. It doesn't ask open-ended questions that expand scope. It asks binary and multiple-choice questions that narrow scope: "Do you need authentication? Yes/No." "Which blog format: MDX, Markdown, or CMS?" "Is pagination needed for the first version?"

/senna-criteria — Defining what "good" means before you build

Before any code exists, you define grading criteria. Not vague quality aspirations — weighted, calibrated rubrics with hard thresholds:

## Functionality (Weight: 5, Threshold: 7/10)
Does it work when you actually use it?

Calibration:
- Score 2: Features exist in code but crash or produce wrong output
- Score 5: Happy path works, edge cases fail
- Score 8: All specified workflows complete, edge cases handled

## Code Quality (Weight: 3, Threshold: 6/10)
Would a senior engineer approve this PR?

## Design Quality (Weight: 4, Threshold: 7/10)  [UI projects]
Visual coherence. Penalizes generic AI aesthetic patterns.

The threshold is the critical concept. A criterion with a score below its threshold causes the entire evaluation to FAIL — regardless of how high other scores are. This prevents the common failure where good scores in easy categories mask failure in hard ones.

/senna-plan — Atomic tasks with dependency graphs

/senna-plan decomposes the spec into flat, atomic tasks. Each task has:

A clear scope (one endpoint, one component, one integration)
Acceptance criteria (specific tests that must pass)
Dependencies (what must be built first)
A verification command

The dependency graph is tracked in Beads — a graph-based issue tracker purpose-built for AI agent workflows. Beads maintains the task graph outside any agent's context, which matters when work spans multiple sessions:

bd list

○ iWebsite-u6h  [epic] Personal Portfolio Website
├── ✓ iWebsite-u6h.1  Project Setup
├── ✓ iWebsite-u6h.2  Core Layout
├── ○ iWebsite-375s   Flagship Video: Hakimo AI
│   → depends on: iWebsite-52er
├── ○ iWebsite-52er   Remotion Video Template
│   → depends on: iWebsite-u6h.1
└── ...

The dependency graph isn't decoration. It's enforced — no agent can start a task whose dependencies aren't complete. This prevents the most common long-horizon failure: building features on top of foundations that don't exist yet.

/pre-race — Contract negotiation before building

This is where the GAN-inspired architecture begins. Before any code is written, the builder and evaluator negotiate a contract:

Builder proposes: What it will build, with numbered testable criteria
Evaluator challenges: Are criteria specific enough? Missing edge cases? Integration gaps?
They iterate via teammate messaging until both agree
Output: CONTRACT.md with 15-30 testable criteria

The evaluator's challenge is structurally adversarial. It's not trying to approve the plan — it's trying to find gaps:

"Criterion #12 says 'chatbot responds with relevant details.' What counts as relevant? Rewrite as: 'response contains mention of Kubernetes, Docker, or orchestration with specific resume details.'"
"No criterion verifies the chat widget is actually mounted in the root layout. A component that exists but isn't wired into the production code path passes all other criteria but doesn't work."
"7 custom Remotion videos in one batch is too much scope. Reduce to 3 flagship + 4 template-based."

That last point — scope management — is something evaluators do better than builders. Builders are optimistic about what they can deliver. Evaluators are skeptical. The negotiation produces realistic contracts.

/senna-drive — The GAN loop

This is the execution engine. It runs the full build-evaluate-fix cycle:

Build phase:

Reads the task graph from Beads
Groups independent tasks for parallel execution (Claude Code teams)
Each builder teammate follows TDD: write test → verify it fails → implement → verify it passes → run full suite → no regressions → commit
Tasks are executed in dependency order

Evaluate phase:

After a batch completes, spawns an evaluator teammate running /senna-qa
Evaluator tests the running application, not just the code
Tests against both CRITERIA.md (scored) and CONTRACT.md (pass/fail)
Produces a structured QA report with bugs, scores, and verdict

Fix phase (if FAIL):

Builder receives the evaluator's prioritized fix list
Fixes issues in priority order
Evaluator re-tests
Up to 3 rounds — if still failing, escalates to user

Build batch → Evaluate → PASS? → Next batch
                 ↓
               FAIL
                 ↓
           Fix issues → Re-evaluate → PASS? → Next batch
                             ↓
                           FAIL
                             ↓
                      Fix again → Re-evaluate → PASS? → Next batch
                                       ↓
                                     FAIL
                                       ↓
                                Escalate to user

The 3-round limit is deliberate. Infinite loops waste context. Three rounds is enough for the builder to address genuine issues. If it can't fix the problems in three rounds, the issues are likely architectural and need human judgment.

The evaluator: adversarial by design

The evaluator is the most important component in the harness, and the most counterintuitive. Its prompt is deliberately one-sided:

NEVER praise work that has obvious issues. Lead with problems.
NEVER say "minor issue" for something that breaks user workflow.
NEVER approve display-only or stub features. If it looks like it
works but doesn't actually do anything → FAIL.
NEVER accept "0 tests ran" as a passing test suite.
Test like a skeptical user, not a supportive colleague.

These aren't guidelines — they're hardcoded anti-generosity rules. Without them, the evaluator defaults to the LLM's natural agreeableness, producing evaluations like "Overall excellent work with a few minor suggestions." That evaluation is useless. It tells you nothing about whether the software actually works.

The most valuable evaluator output is the integration wiring check — verifying that every feature function is actually called from the production entry point. Functions that exist but are never wired into the main code path are the number one post-build bug category. The evaluator greps for call sites:

"Chat widget component exists in components/ChatWidget.tsx but
is never imported in app/layout.tsx. The widget will not appear
on any page. FAIL on criterion #19."

This catch-pattern — component exists, isn't mounted — happens in nearly every non-trivial build. The evaluator catches it because it's structurally incentivized to look for exactly this kind of gap.

Context resets and long-horizon work

Real projects exceed a single context window. The harness handles this through HANDOFF.md — a structured artifact written when context gets large:

# Handoff Artifact

## Completed Tasks
- iWebsite-u6h.1: Project Setup — commit 96a9a2b
- iWebsite-u6h.2: Core Layout — commit 63a2065

## Current Application State
Next.js app with Navbar, Footer, dark theme. npm run dev works.

## Remaining Tasks
- iWebsite-375s: Hakimo video — status: open
- iWebsite-52er: Remotion template — status: open

## Key Decisions Made
- Using @remotion/player for inline video (not pre-rendered MP4)
- Dark theme only (no toggle)

## How to Run
npm run dev → http://localhost:3000

A fresh agent reads HANDOFF.md and resumes from where the previous session left off. The agent doesn't need to reconstruct context from git history — the handoff artifact provides exactly what it needs: what's done, what's next, and what decisions were already made.

Beads complements this by maintaining the task graph independently. Even without HANDOFF.md, an agent can run bd list and know the current state of all tasks.

The Ralph Loop: a name for the methodology

The full methodology — spec-driven decomposition, adversarial evaluation, structured context resets — is what I call the Ralph Loop:

Specify — Define what correct looks like, concretely and measurably
Decompose — Break into atomic, testable tasks with dependency ordering
Contract — Builder and evaluator agree on what "done" looks like
Build — AI agents implement with TDD, one task at a time
Evaluate — Separate evaluator tests the running application adversarially
Fix — Address evaluator findings, re-verify
Reset — When context gets large, write handoff, fresh agent resumes

Steps 4-6 are the inner loop — they repeat until the evaluator passes or escalation occurs. Steps 1-3 happen once per project or feature. Step 7 happens as needed across long builds.

The name comes from the realization that this isn't just a process — it's a feedback loop with a specific structure. The evaluator's output directly drives the builder's next action. The builder's output directly drives the evaluator's next assessment. Each iteration through the loop produces measurably better output than the last.

What this looks like in practice

For the iWebsite project (my personal portfolio), the harness produced:

47 contract criteria negotiated between builder and evaluator
16 tasks decomposed with dependency ordering
3 parallel builder agents executing simultaneously
578 lines of contract criteria tested by the evaluator
2 evaluation rounds (1 fix round for 2 minor issues)
43/47 criteria passing (4 require live environment)
Total build time: single session

The two issues caught in evaluation were exactly the kind the harness is designed to catch: play/pause controls invisible on mobile (CSS hover-only, no touch equivalent), and a sample blog post missing an image that the contract required. Neither would have been caught by automated tests. Both were caught by an evaluator testing the running application against specific contract criteria.

Why this matters now

The capability frontier of AI coding agents is advancing rapidly. But capability without reliability is demo-ware. An agent that can write impressive code 90% of the time and subtly broken code 10% of the time is less useful than an agent that writes simpler code 100% correctly.

The harness is what bridges that gap. It doesn't make the model smarter — it makes the system more reliable by ensuring that every output survives adversarial scrutiny before it ships. The builder doesn't need to be perfect. It needs to be good enough that three rounds of adversarial evaluation and fixing produce a correct result.

This is the same insight that makes GAN training work: neither the generator nor the discriminator needs to be perfect. The adversarial dynamic between them drives both toward better output. In our case, the builder gets better at anticipating what the evaluator will flag. The evaluator gets better at finding the specific classes of bugs that builders produce.

The tools are open source. iSkill contains the skill definitions. Beads handles task orchestration. The methodology is documented in the spec. If you're building with AI agents and want deterministic outcomes instead of probabilistic ones, the harness is the missing piece.