Anthropic published a harness design article last week. It’s basically a retro of the patterns I’ve been using - and it also tells me what to rip out as models get better.
For context: a harness is the orchestration layer wrapped around a Claude agent. The prompts, tools, evaluation loops, context management, multi-agent coordination - everything you build around the model to make it reliable. Not the model itself.
My Alcanah AI workspace is a harness for long-running, multi-project agent work. Here’s how it maps to the patterns in the Anthropic article.
The Mapping
| Anthropic concept | My implementation |
|---|---|
| Context degradation (models lose coherence as windows fill) | Two-tier logging system: temp logs roll up into monthly master logs. When context compacts or a new session starts, the agent picks up from status files and session logs, not from memory of the prior conversation. |
| Generator-Evaluator separation (self-evaluation fails - agents praise their own work) | Specialized agents per role. goals-weekly-review evaluates progress against key results. rollout-runner has a verification phase separate from deployment. workspace-packager has an audit check. The reviewer is never the doer. |
| Sprint contracts (define “done” before implementation) | Product Brief -> Design Doc -> PRD pipeline in Workspace Development/. Same purpose. Prevents cascading specification errors. |
| Planner agent (1-4 sentence prompts -> full specs) | The Plan subagent type + codex-planner skill. High-level direction without over-specifying implementation. |
| Structured criteria over subjective judgment | Rules in .claude/rules/ turn “work well” into concrete, gradable terms. Checklist enforcement. Code modification policy. Session discipline. |
| Context resets with handoff artifacts | Every session starts fresh but reads {project}-status.md, CLAUDE.md, MEMORY.md, and recent logs. Clean-slate context reset with structured handoff. |
The pattern is consistent: the workspace makes assumptions about what the model can’t do reliably, and builds infrastructure to compensate. That’s what every harness does.
The Meta-Principle (and What It Means for Me)
Anthropic’s framing:
Every component in a harness encodes assumptions about model limitations. Those assumptions warrant stress-testing as models improve.
This is the part I’m chewing on. My workspace has accumulated rules, agents, and processes that were necessary when I built them. As Opus jumped from 4.5 to 4.6 and beyond, some of those guardrails may now be overhead.
The article’s example: when Opus 4.6 launched, the researcher systematically removed components rather than redesigning. Sprint decomposition was no longer needed - the model handled coherent work without it. The evaluator’s value became task-dependent: high-value at capability boundaries, unnecessary overhead otherwise.
Stress-test list for my own workspace:
- Heavy TodoWrite checklist enforcement for simple tasks. Originally added because agents would silently skip the boring steps. With newer models, does this still help, or am I just adding ceremony?
- The session-init hooks reminding the agent to log work. Originally added because logging-as-you-go is critical. With newer models, does the AI now do this naturally?
- The “iron rule” pattern from GSD (split tasks that don’t fit in one context). With larger context windows, when is this real vs. theatrical?
The Generator-Evaluator Pattern I Want to Steal
The one pattern I haven’t formalized: the sprint contract. Have the implementing agent and a QA/evaluator agent negotiate what “done” means before work starts, rather than discovering gaps after. My product pipeline already does this between humans and AI. I could formalize it between agents.
That’s the next experiment.
What the Results Looked Like
The article reports numbers worth noting:
- Solo agent (Opus 4.5): 20 minutes, $9, broken core functionality.
- Full harness (Opus 4.5): 6 hours, $200, working gameplay with polished UI and AI integration.
The harness costs 22x in money and 18x in time. It also produces something that works. The space of useful harness combinations doesn’t shrink as models improve - it relocates. Engineers keep finding new combinations at new capability boundaries.
Source
Anthropic Engineering: Harness Design for Long-Running AI Applications, March 2026. Read it as a retro of patterns you’re probably already using if you’ve built anything substantial on Claude Code.