Harness on Wodaixin

Harness Design for Long-Running Application Development

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

Why naive implementations fall short

We’ve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the “Ralph Wiggum” method using hooks or scripts to keep agents in continuous iteration cycles.

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit “context anxiety,” in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps—addresses both these issues.

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn’t give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn’t sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn’t immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. “Is this design beautiful?” is hard to answer consistently, but “does this follow our principles for good design?” gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator’s critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn’t working.

Across runs, the evaluator’s assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didn’t fully anticipate. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn’t seen before from a single-pass generation.

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK’s automatic compaction handling context growth along the way.

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I’d observed in prior runs. The system contained the following agent personas:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs.

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what “done” looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

Harness	Duration	Cost
Solo	20 min	$9
Full harness	6 hr	$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I’d seen in the solo run did remain—the workflow still didn’t make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

Working through the editors, the new run’s advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because I’d asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract’s test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator’s findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

Contract criterion	Evaluator finding
Rectangle fill tool allows click-drag to fill a rectangular area with selected tile	FAIL — Tool only places tiles at drag start/end points instead of filling the region. `fillRectangle` function exists but isn’t triggered properly on mouseUp.
User can select and delete placed entity spawn points	FAIL — Delete key handler at `LevelEditor.tsx:892` requires both `selection` and `selectedEntityId` to be set, but clicking an entity only sets `selectedEntityId`. Condition should be `selection
User can reorder animation frames via API	FAIL — `PUT /frames/reorder` route defined after `/{frame_id}` routes. FastAPI matches ‘reorder’ as a frame_id integer and returns 422: “unable to parse string as an integer.”

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren’t a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator’s logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn’t exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn’t work, the lift was obvious.

Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as “find the simplest solution possible, and only increase complexity when needed,” and it’s a pattern that shows up consistently for anyone maintaining an agent harness.

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn’t able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: “[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes.” It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model’s raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator’s check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app’s own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude’s training data covers it thinly. But with enough tuning, the generator was building agents correctly.

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs. Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total V2 Harness	3 hr 50 min	$124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can’t be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren’t edge cases — they’re the core interactions that make a DAW usable, and the spec explicitly calls for them.

In its second round feedback, it again caught several functionality gaps:

Remaining gaps:

Audio recording is still stub-only (button toggles but no mic capture)

Clip resize by edge drag and clip split not implemented

Effect visualizations are numeric sliders, not graphical (no EQ curve)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agent’s song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you’re building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

Appendix

Example plan generated by planner agent:

RetroForge - 2D Retro Game Maker

Overview RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features

Project Dashboard & Management The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

Create a new game project with a name and description, so that I can begin designing my game
See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
Open any project to enter the full game editor workspace, so that I can work on my game
Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains: Project metadata (name, description, created/modified timestamps) Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144) Tile size configuration (8x8, 16x16, or 32x32 pixels) Color palette selection All associated sprites, tilesets, levels, and entity definitions

长期应用开发的Harness设计

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

在过去几个月里，我一直在研究两个相互关联的问题：让 Claude 产出高质量的前端设计，以及让它在无需人工干预的情况下构建完整的应用程序。这项工作源于我们早期在前端设计技能和长期编码智能体框架上的努力，我和同事们通过提示工程和框架设计将 Claude 的性能提升到远超基线水平——但两者最终都遇到了瓶颈。

为了突破这一瓶颈，我寻找了适用于两个截然不同领域的新型 AI 工程方法，一个由主观品味定义，另一个由可验证的正确性和可用性定义。受生成对抗网络（GANs）的启发，我设计了一个包含生成器和评估器智能体的多智能体结构。构建一个能够可靠地——并且有品味地——评分输出的评估器，意味着首先要开发一套标准，能够将"这个设计好吗？“这样的主观判断转化为具体的、可评分的术语。

然后，我将这些技术应用于长期自主编码，延续了我们早期框架工作中的两个经验：将构建分解为可处理的块，以及使用结构化工件在会话之间传递上下文。最终结果是一个三智能体架构——规划器、生成器和评估器——在多小时的自主编码会话中产出了丰富的全栈应用程序。

为什么简单实现会失败

我们之前已经展示过，框架设计对长期智能体编码的有效性有着重大影响。在早期的实验中，我们使用初始化智能体将产品规格分解为任务列表，以及一个编码智能体逐个功能实现任务，然后传递工件以在会话之间传递上下文。更广泛的开发者社区也趋同于类似的见解，例如使用钩子或脚本让智能体保持持续迭代循环的"Ralph Wiggum"方法。

但一些问题仍然存在。对于更复杂的任务，智能体随着时间推移仍然倾向于偏离轨道。在分解这个问题时，我们观察到智能体执行此类任务时的两种常见失败模式。

首先是模型在冗长任务中随着上下文窗口填满而失去连贯性（参见我们关于上下文工程的文章）。一些模型还表现出"上下文焦虑”，即当它们接近自己认为的上下文限制时，会过早地开始收尾工作。上下文重置——完全清除上下文窗口并启动一个新的智能体，结合传递前一个智能体状态和下一步骤的结构化交接——解决了这两个问题。

这与压缩不同，压缩是将对话的早期部分就地总结，以便同一智能体可以在缩短的历史记录上继续工作。虽然压缩保持了连续性，但它不会给智能体一个干净的起点，这意味着上下文焦虑仍然可能持续存在。重置提供了一个干净的起点，代价是交接工件必须有足够的状态让下一个智能体能够顺利接手工作。在我们早期的测试中，我们发现 Claude Sonnet 4.5 表现出足够强的上下文焦虑，以至于仅靠压缩不足以实现强大的长任务性能，因此上下文重置成为框架设计的关键。这解决了核心问题，但为每次框架运行增加了编排复杂性、令牌开销和延迟。

第二个问题是自我评估，我们之前没有解决过。当被要求评估自己产出的工作时，智能体倾向于自信地赞扬这些工作——即使对人类观察者来说，质量明显平庸。这个问题在设计等主观任务上尤为突出，因为没有类似可验证软件测试的二元检查。布局是否感觉精致或普通是一个判断性问题，而智能体在评分自己的工作时可靠地倾向于积极评价。

然而，即使在确实有可验证结果的任务上，智能体有时仍然表现出糟糕的判断力，这会妨碍它们完成任务时的性能。将执行工作的智能体与评判工作的智能体分离，被证明是解决这个问题的有力杠杆。这种分离本身并不能立即消除那种宽容；评估器仍然是一个倾向于对 LLM 生成的输出慷慨的 LLM。但调整一个独立的评估器使其持怀疑态度，结果证明比让生成器批评自己的工作要容易得多，而一旦存在外部反馈，生成器就有了具体的迭代目标。

前端设计：让主观质量可评分

我从前端设计开始实验，因为自我评估问题在这里最为明显。在没有任何干预的情况下，Claude 通常倾向于安全、可预测的布局，这些布局在技术上是功能性的，但在视觉上并不出众。

两个见解塑造了我为前端设计构建的框架。首先，虽然美学不能完全简化为分数——个人品味总是会有所不同——但可以通过编码设计原则和偏好的评分标准来改进它们。“这个设计漂亮吗？“很难一致地回答，但"这是否遵循我们的良好设计原则？“给了 Claude 一些具体的评分依据。其次，通过将前端生成与前端评分分离，我们可以创建一个反馈循环，推动生成器产出更强的输出。

考虑到这一点，我编写了四个评分标准，并将它们提供给生成器和评估器智能体的提示中：

设计质量： 设计是否感觉像一个连贯的整体，而不是部分的集合？这方面的强大工作意味着颜色、排版、布局、图像和其他细节结合起来创造出独特的氛围和身份。

原创性： 是否有自定义决策的证据，还是这只是模板布局、库默认值和 AI 生成的模式？人类设计师应该能够识别出深思熟虑的创意选择。未修改的库存组件——或 AI 生成的明显迹象，如白色卡片上的紫色渐变——在这里会失败。

工艺： 技术执行：排版层次、间距一致性、色彩和谐、对比度。这是能力检查而不是创造力检查。大多数合理的实现默认情况下在这里表现良好；失败意味着基础被破坏。

功能性： 独立于美学的可用性。用户能否理解界面的功能，找到主要操作，并在不猜测的情况下完成任务？

我强调设计质量和原创性而不是工艺和功能性。Claude 在工艺和功能性上默认得分就很好，因为所需的技术能力往往是模型自然具备的。但在设计和原创性方面，Claude 经常产出充其量只能说是平淡的输出。这些标准明确惩罚高度通用的"AI 垃圾"模式，通过更重视设计和原创性，它推动模型进行更多的美学冒险。

我使用带有详细分数分解的少样本示例来校准评估器。这确保了评估器的判断与我的偏好一致，并减少了迭代之间的分数漂移。

我在 Claude Agent SDK 上构建了这个循环，这使得编排变得简单明了。生成器智能体首先根据用户提示创建 HTML/CSS/JS 前端。我给评估器提供了 Playwright MCP，让它在评分每个标准和撰写详细评论之前直接与实时页面交互。在实践中，评估器会自行浏览页面，在产生评估之前截图并仔细研究实现。该反馈作为下一次迭代的输入流回生成器。我每次生成运行 5 到 15 次迭代，每次迭代通常会随着生成器响应评估器的批评而将其推向更独特的方向。由于评估器是主动浏览页面而不是对静态截图评分，每个周期都需要实际的时钟时间。完整运行最长可达四个小时。我还指示生成器在每次评估后做出战略决策：如果分数趋势良好则完善当前方向，或者如果方法不起作用则完全转向不同的美学方向。

在各次运行中，评估器的评估在迭代中改善，然后趋于平稳，仍有改进空间。一些生成逐步完善。其他生成在迭代之间采取了急剧的美学转变。

标准的措辞以我没有完全预料到的方式引导了生成器。包含"最好的设计是博物馆级别的"这样的短语将设计推向了特定的视觉趋同，表明与标准相关的提示直接塑造了输出的特征。

虽然分数通常在迭代中提高，但模式并不总是清晰的线性。后期的实现往往整体上更好，但我经常看到我更喜欢中间迭代而不是最后一个的情况。实现复杂性也倾向于在各轮中增加，生成器响应评估器的反馈而寻求更雄心勃勃的解决方案。即使在第一次迭代中，输出也明显优于完全没有提示的基线，这表明标准和相关语言本身在任何评估器反馈导致进一步完善之前就将模型引导远离了通用默认值。

在一个值得注意的例子中，我提示模型为一家荷兰艺术博物馆创建一个网站。到第九次迭代时，它为一个虚构的博物馆制作了一个干净的深色主题登陆页面。该页面在视觉上很精致，但基本符合我的预期。然后，在第十个周期，它完全放弃了这种方法，将网站重新想象为一种空间体验：一个用 CSS 透视渲染的带有棋盘地板的 3D 房间，艺术品以自由形式的位置挂在墙上，以及基于门道的画廊房间之间的导航，而不是滚动或点击。这是我以前从未在单次生成中见过的那种创造性飞跃。

扩展到全栈编码

有了这些发现，我将这种受 GAN 启发的模式应用于全栈开发。生成器-评估器循环自然地映射到软件开发生命周期，其中代码审查和 QA 与设计评估器扮演相同的结构角色。

架构

在我们早期的长期运行框架中，我们通过初始化智能体、逐个功能工作的编码智能体以及会话之间的上下文重置来解决连贯的多会话编码问题。上下文重置是一个关键突破：该框架使用 Sonnet 4.5，它表现出前面提到的"上下文焦虑"倾向。创建一个在上下文重置中运行良好的框架是保持模型专注于任务的关键。Opus 4.5 在很大程度上自行消除了这种行为，因此我能够完全从这个框架中删除上下文重置。智能体在整个构建过程中作为一个连续会话运行，Claude Agent SDK 的自动压缩处理了上下文增长。

对于这项工作，我在原始框架的基础上构建了一个三智能体系统，每个智能体都解决了我在之前运行中观察到的特定差距。该系统包含以下智能体角色：

规划器： 我们之前的长期运行框架要求用户预先提供详细的规格。我想自动化这一步骤，所以我创建了一个规划器智能体，它接受一个简单的 1-4 句提示并将其扩展为完整的产品规格。我提示它对范围要有雄心，并专注于产品上下文和高层技术设计，而不是详细的技术实现。这种强调是因为担心如果规划器试图预先指定细粒度的技术细节并出错，规格中的错误会级联到下游实现中。让智能体专注于要产出的交付物并让它们在工作时找出路径似乎更明智。我还要求规划器寻找将 AI 功能融入产品规格的机会。

生成器： 早期框架中的逐个功能方法在范围管理方面效果很好。我在这里应用了类似的模型，指示生成器以冲刺方式工作，从规格中一次选择一个功能。每个冲刺使用 React、Vite、FastAPI 和 SQLite（后来是 PostgreSQL）堆栈实现应用程序，生成器被指示在每个冲刺结束时自我评估其工作，然后交给 QA。它还有 git 用于版本控制。

评估器： 早期框架的应用程序通常看起来令人印象深刻，但当你实际尝试使用它们时仍然有真正的错误。为了捕获这些错误，评估器使用 Playwright MCP 像用户一样点击运行中的应用程序，测试 UI 功能、API 端点和数据库状态。然后，它根据发现的错误和一套标准对每个冲刺进行评分，这套标准以前端实验为模型，在这里适应涵盖产品深度、功能性、视觉设计和代码质量。每个标准都有一个硬阈值，如果任何一个低于它，冲刺就会失败，生成器会得到关于出了什么问题的详细反馈。

在每个冲刺之前，生成器和评估器协商一个冲刺合同：在编写任何代码之前就该工作块的"完成"标准达成一致。这样做是因为产品规格是有意保持高层次的，我想要一个步骤来弥合用户故事和可测试实现之间的差距。生成器提出它将构建什么以及如何验证成功，评估器审查该提案以确保生成器正在构建正确的东西。两者迭代直到达成一致。

通信通过文件处理：一个智能体会写一个文件，另一个智能体会读取它并在该文件内或用前一个智能体将读取的新文件进行响应。然后生成器根据商定的合同进行构建，然后将工作交给 QA。这使工作忠实于规格，而不会过早地过度指定实现。

运行框架

对于这个框架的第一个版本，我使用了 Claude Opus 4.5，针对完整框架和单智能体系统运行用户提示进行比较。我使用 Opus 4.5 是因为这是我开始这些实验时我们最好的编码模型。

我编写了以下提示来生成一个复古视频游戏制作器：

创建一个 2D 复古游戏制作器，功能包括关卡编辑器、精灵编辑器、实体行为和可玩测试模式。

下表显示了框架类型、运行时长和总成本。

框架	时长	成本
单智能体	20 分钟	$9
完整框架	6 小时	$200

框架的成本超过 20 倍，但输出质量的差异立即显现。

我期望的是一个界面，我可以在其中构建关卡及其组成部分（精灵、实体、瓦片布局），然后点击播放来实际玩关卡。我首先打开了单智能体运行的输出，初始应用程序似乎符合这些期望。

然而，当我点击浏览时，问题开始出现。布局浪费空间，固定高度的面板使大部分视口空着。工作流程很僵硬。尝试填充关卡会提示我首先创建精灵和实体，但 UI 中没有任何东西引导我进入该序列。更重要的是，实际的游戏是坏的。我的实体出现在屏幕上，但没有任何东西响应输入。深入代码发现，实体定义和游戏运行时之间的连接是断开的，没有表面迹象表明问题出在哪里。

评估完单智能体运行后，我将注意力转向框架运行。这次运行从相同的一句话提示开始，但规划器步骤将该提示扩展为分布在十个冲刺中的 16 个功能规格。它远远超出了单智能体运行尝试的范围。除了核心编辑器和播放模式外，规格还要求精灵动画系统、行为模板、音效和音乐、AI 辅助的精灵生成器和关卡设计器，以及带有可共享链接的游戏导出。我给了规划器访问我们前端设计技能的权限，它阅读并使用它来为应用程序创建视觉设计语言作为规格的一部分。对于每个冲刺，生成器和评估器协商一个合同，定义冲刺的具体实现细节，以及将被测试以验证完成的可测试行为。

该应用程序立即显示出比单智能体运行更多的精致和流畅性。画布使用了完整的视口，面板大小合理，界面具有与规格中的设计方向一致的一致视觉身份。我在单智能体运行中看到的一些笨拙确实仍然存在——工作流程仍然没有明确表示你应该在尝试填充关卡之前构建精灵和实体，我不得不通过摸索来弄清楚这一点。这被解读为基础模型产品直觉的差距，而不是框架旨在解决的问题，尽管它确实表明了框架内有针对性的迭代可以进一步改善输出质量的地方。

浏览编辑器时，新运行相对于单智能体的优势变得更加明显。精灵编辑器更丰富、功能更全面，具有更清晰的工具调色板、更好的颜色选择器和更可用的缩放控件。

因为我要求规划器将 AI 功能融入其规格中，该应用程序还配备了内置的 Claude 集成，让我可以通过提示生成游戏的不同部分。这大大加快了工作流程。

最大的区别在于播放模式。我实际上能够移动我的实体并玩游戏。物理效果有一些粗糙的边缘——我的角色跳到平台上但最终与它重叠，这在直觉上感觉不对——但核心功能是有效的，而单智能体运行没有做到这一点。移动了一会儿后，我确实遇到了 AI 游戏关卡构建的一些限制。有一堵大墙我无法跳过，所以我被困住了。这表明框架可以处理一些常识性改进和边缘情况以进一步完善应用程序。

阅读日志，很明显评估器使实现与规格保持一致。每个冲刺，它都会遍历冲刺合同的测试标准，并通过 Playwright 执行运行中的应用程序，对任何偏离预期行为的内容提交错误。合同是细粒度的——仅 Sprint 3 就有 27 个涵盖关卡编辑器的标准——评估器的发现足够具体，可以在不进行额外调查的情况下采取行动。下表显示了我们的评估器识别的几个问题示例：

合同标准	评估器发现
矩形填充工具允许点击拖动以用选定的瓦片填充矩形区域	失败 — 工具仅在拖动开始/结束点放置瓦片，而不是填充区域。`fillRectangle` 函数存在但在 mouseUp 时未正确触发。
用户可以选择和删除放置的实体生成点	失败 — `LevelEditor.tsx:892` 的删除键处理程序需要同时设置 `selection` 和 `selectedEntityId`，但点击实体只设置 `selectedEntityId`。条件应该是 `selection
用户可以通过 API 重新排序动画帧	失败 — `PUT /frames/reorder` 路由在 `/{frame_id}` 路由之后定义。FastAPI 将 ‘reorder’ 匹配为 frame_id 整数并返回 422：“无法将字符串解析为整数。”

让评估器达到这个水平需要工作。开箱即用，Claude 是一个糟糕的 QA 智能体。在早期运行中，我看到它识别出合法的问题，然后说服自己决定它们不是什么大问题并批准工作。它还倾向于表面测试，而不是探测边缘情况，因此更微妙的错误经常漏掉。调整循环是阅读评估器的日志，找到其判断与我的判断不同的示例，并更新 QA 的提示以解决这些问题。经过几轮这样的开发循环，评估器才以我认为合理的方式进行评分。即便如此，框架输出显示了模型 QA 能力的局限性：小的布局问题、在某些地方感觉不直观的交互，以及评估器没有彻底执行的更深层嵌套功能中未发现的错误。显然还有更多的验证空间可以通过进一步调整来捕获。但与单智能体运行相比，应用程序的核心功能根本不起作用，提升是显而易见的。

迭代框架

第一组框架结果令人鼓舞，但它也很笨重、缓慢且昂贵。下一个合乎逻辑的步骤是找到简化框架而不降低其性能的方法。这部分是常识，部分是一个更普遍原则的功能：框架中的每个组件都编码了关于模型自身无法做什么的假设，这些假设值得压力测试，既因为它们可能不正确，也因为随着模型的改进它们可能很快过时。我们的博客文章《构建有效的智能体》将基本思想框定为"找到尽可能简单的解决方案，只有在需要时才增加复杂性”，这是任何维护智能体框架的人都会一致看到的模式。

在我第一次尝试简化时，我大幅削减了框架并尝试了一些创造性的新想法，但我无法复制原始框架的性能。也很难判断框架设计的哪些部分实际上是承重的，以及以什么方式。基于这一经验，我转向了一种更有条理的方法，一次删除一个组件并审查它对最终结果的影响。

当我经历这些迭代周期时，我们还发布了 Opus 4.6，这为减少框架复杂性提供了进一步的动力。有充分的理由期望 4.6 需要比 4.5 更少的脚手架。从我们的发布博客："[Opus 4.6] 计划更仔细，更长时间地维持智能体任务，可以在更大的代码库中更可靠地运行，并具有更好的代码审查和调试技能来捕获自己的错误。“它在长上下文检索方面也有了实质性改进。这些都是框架旨在补充的能力。

移除冲刺结构

我首先完全移除了冲刺结构。冲刺结构有助于将工作分解为块，以便模型能够连贯地工作。鉴于 Opus 4.6 的改进，有充分的理由相信模型可以在没有这种分解的情况下原生处理工作。

我保留了规划器和评估器，因为它们都继续增加明显的价值。没有规划器，生成器会缩小范围：给定原始提示，它会在没有首先规划其工作的情况下开始构建，最终创建的应用程序功能不如规划器丰富。

移除冲刺结构后，我将评估器移至运行结束时的单次通过，而不是每个冲刺评分。由于模型的能力大大增强，它改变了评估器对某些运行的承重程度，其有用性取决于任务相对于模型可以单独可靠完成的位置。在 4.5 上，该边界很近：我们的构建处于生成器单独可以做好的边缘，评估器在整个构建中捕获了有意义的问题。在 4.6 上，模型的原始能力增加了，因此边界向外移动。过去需要评估器检查才能连贯实现的任务现在通常在生成器单独处理良好的范围内，对于该边界内的任务，评估器成为不必要的开销。但对于仍处于生成器能力边缘的构建部分，评估器继续提供真正的提升。

实际含义是评估器不是一个固定的是或否决定。当任务超出当前模型单独可靠完成的范围时，它值得付出成本。

除了结构简化之外，我还添加了提示以改进框架如何将 AI 功能构建到每个应用程序中，特别是让生成器构建一个可以通过工具驱动应用程序自身功能的适当智能体。这需要真正的迭代，因为相关知识足够新，以至于 Claude 的训练数据覆盖得很少。但经过足够的调整，生成器正确地构建了智能体。

更新框架的结果

为了测试更新的框架，我使用以下提示生成了一个数字音频工作站（DAW），这是一个用于作曲、录音和混音歌曲的音乐制作程序：

使用 Web Audio API 在浏览器中构建一个功能齐全的 DAW。

运行仍然冗长且昂贵，大约 4 小时和 124 美元的令牌成本。大部分时间都花在了构建器上，它在没有 Opus 4.5 需要的冲刺分解的情况下连贯地运行了两个多小时。

智能体和阶段	时长	成本
规划器	4.7 分钟	$0.46
构建（第 1 轮）	2 小时 7 分钟	$71.08
QA（第 1 轮）	8.8 分钟	$3.24
构建（第 2 轮）	1 小时 2 分钟	$36.89
QA（第 2 轮）	6.8 分钟	$3.09
构建（第 3 轮）	10.9 分钟	$5.88
QA（第 3 轮）	9.6 分钟	$4.06
V2 框架总计	3 小时 50 分钟	$124.70

与之前的框架一样，规划器将一行提示扩展为完整的规格。从日志中，我可以看到生成器模型在规划应用程序和智能体设计、连接智能体以及在交给 QA 之前测试它方面做得很好。

话虽如此，QA 智能体仍然捕获了真正的差距。在其第一轮反馈中，它指出：

这是一个强大的应用程序，具有出色的设计保真度、可靠的 AI 智能体和良好的后端。主要失败点是功能完整性——虽然应用程序看起来令人印象深刻，AI 集成工作良好，但几个核心 DAW 功能只是显示而没有交互深度：片段无法在时间轴上拖动/移动，没有乐器 UI 面板（合成器旋钮、鼓垫），也没有视觉效果编辑器（EQ 曲线、压缩器仪表）。这些不是边缘情况——它们是使 DAW 可用的核心交互，规格明确要求它们。

在其第二轮反馈中，它再次捕获了几个功能差距：

剩余差距：

音频录制仍然只是存根（按钮切换但没有麦克风捕获）

通过边缘拖动调整片段大小和片段分割未实现

效果可视化是数字滑块，而不是图形（没有 EQ 曲线）

生成器在自行处理时仍然容易遗漏细节或存根功能，QA 在捕获这些最后一英里问题以供生成器修复方面仍然增加了价值。

根据提示，我期望的是一个程序，我可以在其中创建旋律、和声和鼓模式，将它们编排成一首歌曲，并在此过程中从集成的智能体获得帮助。下面的视频显示了结果。

该应用程序远非专业的音乐制作程序，智能体的歌曲创作技能显然还需要大量工作。此外，Claude 实际上听不到声音，这使得 QA 反馈循环在音乐品味方面效果较差。

但最终的应用程序具有功能性音乐制作程序的所有核心部分：在浏览器中运行的工作编排视图、混音器和传输。除此之外，我能够完全通过提示组合一个简短的歌曲片段：智能体设置了速度和调性，铺设了旋律，构建了鼓轨道，调整了混音器电平，并添加了混响。歌曲创作的核心原语都存在，智能体可以自主驱动它们，使用工具从头到尾创建一个简单的作品。你可能会说它还不够完美——但它正在接近。

接下来是什么

随着模型的不断改进，我们可以大致预期它们能够工作更长时间，并处理更复杂的任务。在某些情况下，这意味着围绕模型的脚手架随着时间的推移变得不那么重要，开发人员可以等待下一个模型并看到某些问题自行解决。另一方面，模型越好，就有越多的空间来开发能够完成超出模型基线能力的复杂任务的框架。

考虑到这一点，这项工作中有几个值得继续发扬的经验教训。实验你正在构建的模型、阅读其在现实问题上的跟踪并调整其性能以实现你期望的结果始终是良好的实践。在处理更复杂的任务时，有时可以通过分解任务并将专门的智能体应用于问题的每个方面来获得改进空间。当新模型发布时，重新审查框架通常是良好的实践，剥离不再对性能承重的部分，并添加新部分以实现以前可能无法实现的更大能力。

从这项工作中，我的信念是，随着模型的改进，有趣的框架组合空间不会缩小。相反，它会移动，AI 工程师的有趣工作是不断寻找下一个新颖的组合。

致谢

特别感谢 Alex Albert、Erik Schluntz、Mike Krieger 和 Zack Witten 对这项工作的贡献和反馈。