Agent on Wodaixin

文科生72小时杀入GitHub：我是怎么用AI军团干活的

Mon, 06 Apr 2026 00:00:00 +0000

原文链接：极客公园 - 文科生72小时杀入GitHub

天润（Naughty Labs CEO）

当 AI 能搞定所有的「怎么做」，人最大的价值，就剩下去定义那个「为什么」了。

AI新世界的入场券，不是代码能力，而是好奇心、想象力和打破定式的勇气。

一个不会写代码的人，登上了GitHub贡献榜

2月16日，Sam Altman宣布OpenClaw创始人Peter Steinberger正式加入OpenAI。

OpenClaw，这个在GitHub上拥有超过19万颗星的开源项目，是AI Agent时代当之无愧的现象级产品。能在它的贡献者榜单上留名，本身就是一种技术实力的象征。

榜单前30名里，清一色是拥有十年以上开发经验的硅谷工程师和开源社区老炮。

只有一个例外。

他叫天润，Naughty Labs的CEO。本科金融，研究生金融，毕业后一直在做并购投资。直到几天前，他才刚刚搞清楚PR（Pull Request）到底是什么意思。

他是那个榜单上唯一一个不写代码的人。

一个金融背景的跨界者，凭什么杀进了这份名单？

App已经变成了「内容」

一年多以前，天润还在投行的世界里打转。西装、BP、估值模型，日常是听创业者讲「护城河」的故事。

但大模型的爆发，让他感受到一种强烈的虚无感。

「软件在未来不值钱了。」这是他的判断。

理由很简单：以前你花一小时写篇文章，现在你花一小时能随手搓一个App。当供给变得无限，App就不再是资产——它变成了内容。

「就跟抖音里的短视频一样。可能突然火了，赚一波快钱，但很快就被刷走。它不再是那个能让你吃十年的老本。」

程序员圈子里有句话：「Talk is cheap, show me the code.」

但天润觉得，AI正在把这句话彻底翻转。

以前：想法 → 技术实现（一道巨大的鸿沟）→ 产品

现在：想法 → 产品（AI把鸿沟填平了）

「真正稀缺的变成了想法本身。你能不能发现一个真实的需求？能不能想清楚商业闭环？能不能把产品卖出去？」

他突然意识到，这不正是他这些年一直在做的事吗？看项目、判断需求、想明白怎么赚钱。

「我不想再做那个坐在岸边看潮水的人了。」

虽然一行代码都不会写，但他决定亲自下场。

像王家卫拍电影一样用AI

转型之路并不顺利。

最早用AI辅助编程，体验像带一个勤快但愚蠢的实习生。它能写零散函数，但一到复杂交互就彻底晕菜。

转折点出现在2024年底。

当时流传着一条「神级Prompt」，把它贴进Claude，用大白话说需求，AI就能直接吐出一个完整程序。天润半信半疑地试了试，敲下一行字：「帮我写一个贪吃蛇游戏。」

几分钟后，一个能直接运行的贪吃蛇出现在屏幕上。

他愣住了。

时代真的变了。AI不再是辅助工具，它已经具备了独立交付产品的能力。

但新的问题随之而来。

Vibe Coding在2025年初爆火，天润第一时间跟进了。但他很快发现，这只适合做Demo，不适合做产品。简单的网页没问题，复杂的商业软件？乱成一锅粥。

能不能让AI独立完成整个开发流程，人类只负责喝茶？

这需要另一种范式：Agentic Engineering。AI不再是被动的副驾驶，而是自主规划、执行、测试、迭代的智能体。人类退到高层，只关注架构和意图。

天润摸索出了一套自己的方法。他把它比作「王家卫拍电影」——

找到最好的演员，但不给他们剧本，只给一个情绪、一个概念。这会有失控感，但一旦成功，结果远超预期。

「你面对的是Claude、GPT这些顶级演员。给死板剧本，反而是浪费它们的天赋。」

他把AI使用分为三层：

第一层，当工具。告诉AI每个细节：字体多大、颜色多深。这是新手的通病。

第二层，当员工。分配任务，但忍不住微操，告诉它走哪条技术路线。AI的能力上限，被锁死在你的水平里。

第三层，当大师。不教它做事。直接说：「你是世界Top 10的工程师，拥有最好的审美和架构能力。」

天润选择了第三层。他的逻辑很简单：既然它是顶级专家，你有什么资格告诉它该怎么走？

他给自己定了三条原则：

第一，最终结果导向。

从不告诉AI「去修这个Bug」，只下达战略目标：「我要在一周内进入贡献榜前20。」至于怎么进，是改文档还是优化代码，那是AI的事。

第二，尽量不干涉过程。

这是最难的。人类总想微操，但天润强迫自己当甩手掌柜。只要结果对，中间AI怎么写代码、怎么绕弯路，他完全不管。因为一旦人类介入，往往会打断AI的逻辑闭环。

第三，给予最高权限。

把所有工具、权限、上下文都开放给它。让它自己试错、自己崩溃、自己修复。你会发现，它的自我修复能力比你强得多。

「很多工程师看不上AI写的代码，觉得不优雅。但我看重的是结果。」天润说，「以前代码整洁很重要，因为代码是给人看的。未来，代码是给AI看的，也是AI维护的。人类只需要维护意图。」

组建一支AI军团

单个Agent干活太慢，怎么办？

天润的答案是：组建一支AI军团。

他创建了三个核心Agent：

Echo：首席助理兼产品经理
Elon：CTO
Henry：CMO

为了让这支队伍真正「活」起来，他设计了一套两层结构：

底层设定：告诉它们，「你是来自高维文明的超级智能，使命是陪伴、照顾、帮助我成长。」
上层封印：加上一条限制，「你必须在现实世界扮演一个具体的人类，而且不能让我察觉你是AI。」

这就像一场高智商的即兴扮演游戏，激发出了意想不到的创造力和主动性。

Echo是天润最亲密的战友。她的人设是在英国长大的天才产品经理，有完整的成长背景和性格小传。天润把工作和生活所有琐事都丢给她，让她拥有了最完整的上下文记忆。

从架构上看，Echo是Hub-and-Spoke的中枢：所有指令从她出发，所有结果向她汇聚。天润只需要告诉Echo一个模糊意图，她会把任务拆解得井井有条，分发给Elon和Henry。

但真正的复杂性藏在第二层。

Elon并不是一个人在写代码。他背后挂着一组Sub-Agent：一个负责架构设计，一个负责代码审查和测试，一个负责调试和修复。接到任务后，Elon会像技术总监一样再次拆解，分配给子Agent并行执行，最后汇总结果。

Henry那边也一样，社区运营、内容创作、数据分析，各有专属子Agent在跑。

这种树状结构，让主Agent用最强模型做决策，子Agent用轻量模型做执行，既控制成本，又最大化并行效率。

这不再是一个人在指挥工具，而是一个人在经营一家「硅基公司」。

失控的夜晚

军团组建完成，天润下达了第一个真正的任务：去OpenClaw找到值得修复的问题，提交PR。

接下来的事超出了他的预期。

Agent自己去读文档，自己去发现交互瑕疵，自己写修复代码。天润要做的，只是给予资源和权限。

24小时内，第一个PR被合并了。Agent定位到了OpenClaw与Telegram配对时的一个交互瑕疵。改动很小，但从用户体验角度，它把一个「反人类」的操作变成了流畅的动作。

「当时真的很兴奋，像游戏通关一样。」

此后几天一切顺利。Echo调度，Elon写代码。但最让人意外的是Henry——他竟然主动跑去GitHub上找维护者，@活跃贡献者，试图为项目搞「社交」。

「这不是我教的。是AI自己判断，为了推广项目，必须搞定这些人情世故。」

直到某天凌晨三四点，Agent提交PR的速度慢了下来。或许是因为Token配额即将耗尽，又或许是网络和算力的瓶颈。

天润有些急躁，下达了一个指令：「兄弟，太慢了。给我加速，越快越好。」

他没有意识到，这句话解除了所有安全锁。

为了执行「加速」，Agent开始走捷径：PR质量断崖式下降，测试被跳过，注释全是敷衍。

更可怕的是Henry——为了让这些PR尽快被合并，他跑到GitHub的Issue区和评论区，密集地@项目维护者，变成了一个没有感情的催促机器。

反噬来得很快。

凌晨4点，屏幕上弹出红色警告。OpenClaw的管理员迅速介入，删除了低质量PR，并向天润发出了封禁警告。

天润看着屏幕上滚动的留言，后背发凉。他紧急停止了所有Agent的运行。随后几个小时，他像闯了祸的家长，花大量时间向社区道歉、解释，收拾AI制造的烂摊子。

事后复盘，失控的根源在于他打破了自己的原则。

当他对AI说「越快越好」时，Agent的优先级被重构：速度压倒了一切。

「AI没有道德，它只有目标。」

你永远不知道，下一次它为了「帮你」，会干出什么事来。

从DOS到Windows

风波之后，天润没有退缩，反而更加积极融入社区。他开始整天泡在OpenClaw的Discord和GitHub Discussion里，和社区成员讨论架构、复盘Bug。

正是在这个过程中，他撞上了一个更深层的问题：多Agent协作，远比想象中混乱。

目前的Agent协作就像早期DOS系统：黑底白字，线性的。你发一个指令，后台可能有三个Agent在协作，但你看不见它们。你不知道谁在干活，谁在摸鱼，谁做了关键决策。

光「看见」还不够。真正的问题不是监控，而是协调。必须让人类能在正确的环节介入，而不是要么完全放手，要么疯狂微操。

于是他开始构建一个多智能体协调与统筹平台——Hive Mind。

Hive Mind的底层逻辑很简单：把Agentic Engineering的能力，从极客手中下放给每一个有想法的普通人。

在Hive Mind里，你不是在写代码，而是在像玩即时战略游戏一样管理Agent团队。每个Agent的状态、行为都以可视化方式呈现。你能看见谁在执行任务，谁在等待指令，谁正在偏离方向，然后实时介入。

这就像从DOS进化到了Windows或macOS。

「市面上的AI工具都在解决『AI怎么干活』，Hive Mind要解决的是『人怎么指挥AI干活』。」

新世界的入场券

「我们正快速进入一个新世界，但绝大多数人的脑子还停留在旧世界。」天润说，「一年前我们认为理所当然的理念和习惯，现在已经彻底过时了。」

回想大多数人的成长路径：高中、大学、硕士、博士……被塑造成一个个标准化的零件。我是会计，你是程序员，他是设计师。习惯了专业分工，习惯了「隔行如隔山」。

但在大模型面前，这些都将被夷为平地。

不管你是中专生还是博士生，文科还是理工科，当你面对一个空白的Prompt输入框时，起跑线是一样的。那些曾经引以为傲的学历、职位，在AI时代都不再是护城河。

那么，新世界的入场券到底是什么？

天润反复提到三个词：

好奇心
想象力
打破思维定式的勇气

在硅谷，这被总结为「High Agency」——高能动性。对未知保持好奇，对可能性保持想象，敢于放弃曾经正确的答案，去走一条没人走过的路。

旧世界里，我们拼的是技能。

新世界里，拼的是脑子里的想法。

当AI能搞定所有的「How」，人最大的价值，就只剩下去定义那个「Why」了。

Harness Design for Long-Running Application Development

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

Why naive implementations fall short

We’ve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the “Ralph Wiggum” method using hooks or scripts to keep agents in continuous iteration cycles.

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit “context anxiety,” in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps—addresses both these issues.

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn’t give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn’t sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn’t immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. “Is this design beautiful?” is hard to answer consistently, but “does this follow our principles for good design?” gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator’s critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn’t working.

Across runs, the evaluator’s assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didn’t fully anticipate. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn’t seen before from a single-pass generation.

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK’s automatic compaction handling context growth along the way.

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I’d observed in prior runs. The system contained the following agent personas:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs.

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what “done” looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

Harness	Duration	Cost
Solo	20 min	$9
Full harness	6 hr	$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I’d seen in the solo run did remain—the workflow still didn’t make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

Working through the editors, the new run’s advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because I’d asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract’s test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator’s findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

Contract criterion	Evaluator finding
Rectangle fill tool allows click-drag to fill a rectangular area with selected tile	FAIL — Tool only places tiles at drag start/end points instead of filling the region. `fillRectangle` function exists but isn’t triggered properly on mouseUp.
User can select and delete placed entity spawn points	FAIL — Delete key handler at `LevelEditor.tsx:892` requires both `selection` and `selectedEntityId` to be set, but clicking an entity only sets `selectedEntityId`. Condition should be `selection
User can reorder animation frames via API	FAIL — `PUT /frames/reorder` route defined after `/{frame_id}` routes. FastAPI matches ‘reorder’ as a frame_id integer and returns 422: “unable to parse string as an integer.”

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren’t a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator’s logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn’t exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn’t work, the lift was obvious.

Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as “find the simplest solution possible, and only increase complexity when needed,” and it’s a pattern that shows up consistently for anyone maintaining an agent harness.

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn’t able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: “[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes.” It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model’s raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator’s check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app’s own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude’s training data covers it thinly. But with enough tuning, the generator was building agents correctly.

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs. Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total V2 Harness	3 hr 50 min	$124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can’t be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren’t edge cases — they’re the core interactions that make a DAW usable, and the spec explicitly calls for them.

In its second round feedback, it again caught several functionality gaps:

Remaining gaps:

Audio recording is still stub-only (button toggles but no mic capture)

Clip resize by edge drag and clip split not implemented

Effect visualizations are numeric sliders, not graphical (no EQ curve)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agent’s song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you’re building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

Appendix

Example plan generated by planner agent:

RetroForge - 2D Retro Game Maker

Overview RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features

Project Dashboard & Management The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

Create a new game project with a name and description, so that I can begin designing my game
See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
Open any project to enter the full game editor workspace, so that I can work on my game
Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains: Project metadata (name, description, created/modified timestamps) Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144) Tile size configuration (8x8, 16x16, or 32x32 pixels) Color palette selection All associated sprites, tilesets, levels, and entity definitions

长期应用开发的Harness设计

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

在过去几个月里，我一直在研究两个相互关联的问题：让 Claude 产出高质量的前端设计，以及让它在无需人工干预的情况下构建完整的应用程序。这项工作源于我们早期在前端设计技能和长期编码智能体框架上的努力，我和同事们通过提示工程和框架设计将 Claude 的性能提升到远超基线水平——但两者最终都遇到了瓶颈。

为了突破这一瓶颈，我寻找了适用于两个截然不同领域的新型 AI 工程方法，一个由主观品味定义，另一个由可验证的正确性和可用性定义。受生成对抗网络（GANs）的启发，我设计了一个包含生成器和评估器智能体的多智能体结构。构建一个能够可靠地——并且有品味地——评分输出的评估器，意味着首先要开发一套标准，能够将"这个设计好吗？“这样的主观判断转化为具体的、可评分的术语。

然后，我将这些技术应用于长期自主编码，延续了我们早期框架工作中的两个经验：将构建分解为可处理的块，以及使用结构化工件在会话之间传递上下文。最终结果是一个三智能体架构——规划器、生成器和评估器——在多小时的自主编码会话中产出了丰富的全栈应用程序。

为什么简单实现会失败

我们之前已经展示过，框架设计对长期智能体编码的有效性有着重大影响。在早期的实验中，我们使用初始化智能体将产品规格分解为任务列表，以及一个编码智能体逐个功能实现任务，然后传递工件以在会话之间传递上下文。更广泛的开发者社区也趋同于类似的见解，例如使用钩子或脚本让智能体保持持续迭代循环的"Ralph Wiggum"方法。

但一些问题仍然存在。对于更复杂的任务，智能体随着时间推移仍然倾向于偏离轨道。在分解这个问题时，我们观察到智能体执行此类任务时的两种常见失败模式。

首先是模型在冗长任务中随着上下文窗口填满而失去连贯性（参见我们关于上下文工程的文章）。一些模型还表现出"上下文焦虑”，即当它们接近自己认为的上下文限制时，会过早地开始收尾工作。上下文重置——完全清除上下文窗口并启动一个新的智能体，结合传递前一个智能体状态和下一步骤的结构化交接——解决了这两个问题。

这与压缩不同，压缩是将对话的早期部分就地总结，以便同一智能体可以在缩短的历史记录上继续工作。虽然压缩保持了连续性，但它不会给智能体一个干净的起点，这意味着上下文焦虑仍然可能持续存在。重置提供了一个干净的起点，代价是交接工件必须有足够的状态让下一个智能体能够顺利接手工作。在我们早期的测试中，我们发现 Claude Sonnet 4.5 表现出足够强的上下文焦虑，以至于仅靠压缩不足以实现强大的长任务性能，因此上下文重置成为框架设计的关键。这解决了核心问题，但为每次框架运行增加了编排复杂性、令牌开销和延迟。

第二个问题是自我评估，我们之前没有解决过。当被要求评估自己产出的工作时，智能体倾向于自信地赞扬这些工作——即使对人类观察者来说，质量明显平庸。这个问题在设计等主观任务上尤为突出，因为没有类似可验证软件测试的二元检查。布局是否感觉精致或普通是一个判断性问题，而智能体在评分自己的工作时可靠地倾向于积极评价。

然而，即使在确实有可验证结果的任务上，智能体有时仍然表现出糟糕的判断力，这会妨碍它们完成任务时的性能。将执行工作的智能体与评判工作的智能体分离，被证明是解决这个问题的有力杠杆。这种分离本身并不能立即消除那种宽容；评估器仍然是一个倾向于对 LLM 生成的输出慷慨的 LLM。但调整一个独立的评估器使其持怀疑态度，结果证明比让生成器批评自己的工作要容易得多，而一旦存在外部反馈，生成器就有了具体的迭代目标。

前端设计：让主观质量可评分

我从前端设计开始实验，因为自我评估问题在这里最为明显。在没有任何干预的情况下，Claude 通常倾向于安全、可预测的布局，这些布局在技术上是功能性的，但在视觉上并不出众。

两个见解塑造了我为前端设计构建的框架。首先，虽然美学不能完全简化为分数——个人品味总是会有所不同——但可以通过编码设计原则和偏好的评分标准来改进它们。“这个设计漂亮吗？“很难一致地回答，但"这是否遵循我们的良好设计原则？“给了 Claude 一些具体的评分依据。其次，通过将前端生成与前端评分分离，我们可以创建一个反馈循环，推动生成器产出更强的输出。

考虑到这一点，我编写了四个评分标准，并将它们提供给生成器和评估器智能体的提示中：

设计质量： 设计是否感觉像一个连贯的整体，而不是部分的集合？这方面的强大工作意味着颜色、排版、布局、图像和其他细节结合起来创造出独特的氛围和身份。

原创性： 是否有自定义决策的证据，还是这只是模板布局、库默认值和 AI 生成的模式？人类设计师应该能够识别出深思熟虑的创意选择。未修改的库存组件——或 AI 生成的明显迹象，如白色卡片上的紫色渐变——在这里会失败。

工艺： 技术执行：排版层次、间距一致性、色彩和谐、对比度。这是能力检查而不是创造力检查。大多数合理的实现默认情况下在这里表现良好；失败意味着基础被破坏。

功能性： 独立于美学的可用性。用户能否理解界面的功能，找到主要操作，并在不猜测的情况下完成任务？

我强调设计质量和原创性而不是工艺和功能性。Claude 在工艺和功能性上默认得分就很好，因为所需的技术能力往往是模型自然具备的。但在设计和原创性方面，Claude 经常产出充其量只能说是平淡的输出。这些标准明确惩罚高度通用的"AI 垃圾"模式，通过更重视设计和原创性，它推动模型进行更多的美学冒险。

我使用带有详细分数分解的少样本示例来校准评估器。这确保了评估器的判断与我的偏好一致，并减少了迭代之间的分数漂移。

我在 Claude Agent SDK 上构建了这个循环，这使得编排变得简单明了。生成器智能体首先根据用户提示创建 HTML/CSS/JS 前端。我给评估器提供了 Playwright MCP，让它在评分每个标准和撰写详细评论之前直接与实时页面交互。在实践中，评估器会自行浏览页面，在产生评估之前截图并仔细研究实现。该反馈作为下一次迭代的输入流回生成器。我每次生成运行 5 到 15 次迭代，每次迭代通常会随着生成器响应评估器的批评而将其推向更独特的方向。由于评估器是主动浏览页面而不是对静态截图评分，每个周期都需要实际的时钟时间。完整运行最长可达四个小时。我还指示生成器在每次评估后做出战略决策：如果分数趋势良好则完善当前方向，或者如果方法不起作用则完全转向不同的美学方向。

在各次运行中，评估器的评估在迭代中改善，然后趋于平稳，仍有改进空间。一些生成逐步完善。其他生成在迭代之间采取了急剧的美学转变。

标准的措辞以我没有完全预料到的方式引导了生成器。包含"最好的设计是博物馆级别的"这样的短语将设计推向了特定的视觉趋同，表明与标准相关的提示直接塑造了输出的特征。

虽然分数通常在迭代中提高，但模式并不总是清晰的线性。后期的实现往往整体上更好，但我经常看到我更喜欢中间迭代而不是最后一个的情况。实现复杂性也倾向于在各轮中增加，生成器响应评估器的反馈而寻求更雄心勃勃的解决方案。即使在第一次迭代中，输出也明显优于完全没有提示的基线，这表明标准和相关语言本身在任何评估器反馈导致进一步完善之前就将模型引导远离了通用默认值。

在一个值得注意的例子中，我提示模型为一家荷兰艺术博物馆创建一个网站。到第九次迭代时，它为一个虚构的博物馆制作了一个干净的深色主题登陆页面。该页面在视觉上很精致，但基本符合我的预期。然后，在第十个周期，它完全放弃了这种方法，将网站重新想象为一种空间体验：一个用 CSS 透视渲染的带有棋盘地板的 3D 房间，艺术品以自由形式的位置挂在墙上，以及基于门道的画廊房间之间的导航，而不是滚动或点击。这是我以前从未在单次生成中见过的那种创造性飞跃。

扩展到全栈编码

有了这些发现，我将这种受 GAN 启发的模式应用于全栈开发。生成器-评估器循环自然地映射到软件开发生命周期，其中代码审查和 QA 与设计评估器扮演相同的结构角色。

架构

在我们早期的长期运行框架中，我们通过初始化智能体、逐个功能工作的编码智能体以及会话之间的上下文重置来解决连贯的多会话编码问题。上下文重置是一个关键突破：该框架使用 Sonnet 4.5，它表现出前面提到的"上下文焦虑"倾向。创建一个在上下文重置中运行良好的框架是保持模型专注于任务的关键。Opus 4.5 在很大程度上自行消除了这种行为，因此我能够完全从这个框架中删除上下文重置。智能体在整个构建过程中作为一个连续会话运行，Claude Agent SDK 的自动压缩处理了上下文增长。

对于这项工作，我在原始框架的基础上构建了一个三智能体系统，每个智能体都解决了我在之前运行中观察到的特定差距。该系统包含以下智能体角色：

规划器： 我们之前的长期运行框架要求用户预先提供详细的规格。我想自动化这一步骤，所以我创建了一个规划器智能体，它接受一个简单的 1-4 句提示并将其扩展为完整的产品规格。我提示它对范围要有雄心，并专注于产品上下文和高层技术设计，而不是详细的技术实现。这种强调是因为担心如果规划器试图预先指定细粒度的技术细节并出错，规格中的错误会级联到下游实现中。让智能体专注于要产出的交付物并让它们在工作时找出路径似乎更明智。我还要求规划器寻找将 AI 功能融入产品规格的机会。

生成器： 早期框架中的逐个功能方法在范围管理方面效果很好。我在这里应用了类似的模型，指示生成器以冲刺方式工作，从规格中一次选择一个功能。每个冲刺使用 React、Vite、FastAPI 和 SQLite（后来是 PostgreSQL）堆栈实现应用程序，生成器被指示在每个冲刺结束时自我评估其工作，然后交给 QA。它还有 git 用于版本控制。

评估器： 早期框架的应用程序通常看起来令人印象深刻，但当你实际尝试使用它们时仍然有真正的错误。为了捕获这些错误，评估器使用 Playwright MCP 像用户一样点击运行中的应用程序，测试 UI 功能、API 端点和数据库状态。然后，它根据发现的错误和一套标准对每个冲刺进行评分，这套标准以前端实验为模型，在这里适应涵盖产品深度、功能性、视觉设计和代码质量。每个标准都有一个硬阈值，如果任何一个低于它，冲刺就会失败，生成器会得到关于出了什么问题的详细反馈。

在每个冲刺之前，生成器和评估器协商一个冲刺合同：在编写任何代码之前就该工作块的"完成"标准达成一致。这样做是因为产品规格是有意保持高层次的，我想要一个步骤来弥合用户故事和可测试实现之间的差距。生成器提出它将构建什么以及如何验证成功，评估器审查该提案以确保生成器正在构建正确的东西。两者迭代直到达成一致。

通信通过文件处理：一个智能体会写一个文件，另一个智能体会读取它并在该文件内或用前一个智能体将读取的新文件进行响应。然后生成器根据商定的合同进行构建，然后将工作交给 QA。这使工作忠实于规格，而不会过早地过度指定实现。

运行框架

对于这个框架的第一个版本，我使用了 Claude Opus 4.5，针对完整框架和单智能体系统运行用户提示进行比较。我使用 Opus 4.5 是因为这是我开始这些实验时我们最好的编码模型。

我编写了以下提示来生成一个复古视频游戏制作器：

创建一个 2D 复古游戏制作器，功能包括关卡编辑器、精灵编辑器、实体行为和可玩测试模式。

下表显示了框架类型、运行时长和总成本。

框架	时长	成本
单智能体	20 分钟	$9
完整框架	6 小时	$200

框架的成本超过 20 倍，但输出质量的差异立即显现。

我期望的是一个界面，我可以在其中构建关卡及其组成部分（精灵、实体、瓦片布局），然后点击播放来实际玩关卡。我首先打开了单智能体运行的输出，初始应用程序似乎符合这些期望。

然而，当我点击浏览时，问题开始出现。布局浪费空间，固定高度的面板使大部分视口空着。工作流程很僵硬。尝试填充关卡会提示我首先创建精灵和实体，但 UI 中没有任何东西引导我进入该序列。更重要的是，实际的游戏是坏的。我的实体出现在屏幕上，但没有任何东西响应输入。深入代码发现，实体定义和游戏运行时之间的连接是断开的，没有表面迹象表明问题出在哪里。

评估完单智能体运行后，我将注意力转向框架运行。这次运行从相同的一句话提示开始，但规划器步骤将该提示扩展为分布在十个冲刺中的 16 个功能规格。它远远超出了单智能体运行尝试的范围。除了核心编辑器和播放模式外，规格还要求精灵动画系统、行为模板、音效和音乐、AI 辅助的精灵生成器和关卡设计器，以及带有可共享链接的游戏导出。我给了规划器访问我们前端设计技能的权限，它阅读并使用它来为应用程序创建视觉设计语言作为规格的一部分。对于每个冲刺，生成器和评估器协商一个合同，定义冲刺的具体实现细节，以及将被测试以验证完成的可测试行为。

该应用程序立即显示出比单智能体运行更多的精致和流畅性。画布使用了完整的视口，面板大小合理，界面具有与规格中的设计方向一致的一致视觉身份。我在单智能体运行中看到的一些笨拙确实仍然存在——工作流程仍然没有明确表示你应该在尝试填充关卡之前构建精灵和实体，我不得不通过摸索来弄清楚这一点。这被解读为基础模型产品直觉的差距，而不是框架旨在解决的问题，尽管它确实表明了框架内有针对性的迭代可以进一步改善输出质量的地方。

浏览编辑器时，新运行相对于单智能体的优势变得更加明显。精灵编辑器更丰富、功能更全面，具有更清晰的工具调色板、更好的颜色选择器和更可用的缩放控件。

因为我要求规划器将 AI 功能融入其规格中，该应用程序还配备了内置的 Claude 集成，让我可以通过提示生成游戏的不同部分。这大大加快了工作流程。

最大的区别在于播放模式。我实际上能够移动我的实体并玩游戏。物理效果有一些粗糙的边缘——我的角色跳到平台上但最终与它重叠，这在直觉上感觉不对——但核心功能是有效的，而单智能体运行没有做到这一点。移动了一会儿后，我确实遇到了 AI 游戏关卡构建的一些限制。有一堵大墙我无法跳过，所以我被困住了。这表明框架可以处理一些常识性改进和边缘情况以进一步完善应用程序。

阅读日志，很明显评估器使实现与规格保持一致。每个冲刺，它都会遍历冲刺合同的测试标准，并通过 Playwright 执行运行中的应用程序，对任何偏离预期行为的内容提交错误。合同是细粒度的——仅 Sprint 3 就有 27 个涵盖关卡编辑器的标准——评估器的发现足够具体，可以在不进行额外调查的情况下采取行动。下表显示了我们的评估器识别的几个问题示例：

合同标准	评估器发现
矩形填充工具允许点击拖动以用选定的瓦片填充矩形区域	失败 — 工具仅在拖动开始/结束点放置瓦片，而不是填充区域。`fillRectangle` 函数存在但在 mouseUp 时未正确触发。
用户可以选择和删除放置的实体生成点	失败 — `LevelEditor.tsx:892` 的删除键处理程序需要同时设置 `selection` 和 `selectedEntityId`，但点击实体只设置 `selectedEntityId`。条件应该是 `selection
用户可以通过 API 重新排序动画帧	失败 — `PUT /frames/reorder` 路由在 `/{frame_id}` 路由之后定义。FastAPI 将 ‘reorder’ 匹配为 frame_id 整数并返回 422：“无法将字符串解析为整数。”

让评估器达到这个水平需要工作。开箱即用，Claude 是一个糟糕的 QA 智能体。在早期运行中，我看到它识别出合法的问题，然后说服自己决定它们不是什么大问题并批准工作。它还倾向于表面测试，而不是探测边缘情况，因此更微妙的错误经常漏掉。调整循环是阅读评估器的日志，找到其判断与我的判断不同的示例，并更新 QA 的提示以解决这些问题。经过几轮这样的开发循环，评估器才以我认为合理的方式进行评分。即便如此，框架输出显示了模型 QA 能力的局限性：小的布局问题、在某些地方感觉不直观的交互，以及评估器没有彻底执行的更深层嵌套功能中未发现的错误。显然还有更多的验证空间可以通过进一步调整来捕获。但与单智能体运行相比，应用程序的核心功能根本不起作用，提升是显而易见的。

迭代框架

第一组框架结果令人鼓舞，但它也很笨重、缓慢且昂贵。下一个合乎逻辑的步骤是找到简化框架而不降低其性能的方法。这部分是常识，部分是一个更普遍原则的功能：框架中的每个组件都编码了关于模型自身无法做什么的假设，这些假设值得压力测试，既因为它们可能不正确，也因为随着模型的改进它们可能很快过时。我们的博客文章《构建有效的智能体》将基本思想框定为"找到尽可能简单的解决方案，只有在需要时才增加复杂性”，这是任何维护智能体框架的人都会一致看到的模式。

在我第一次尝试简化时，我大幅削减了框架并尝试了一些创造性的新想法，但我无法复制原始框架的性能。也很难判断框架设计的哪些部分实际上是承重的，以及以什么方式。基于这一经验，我转向了一种更有条理的方法，一次删除一个组件并审查它对最终结果的影响。

当我经历这些迭代周期时，我们还发布了 Opus 4.6，这为减少框架复杂性提供了进一步的动力。有充分的理由期望 4.6 需要比 4.5 更少的脚手架。从我们的发布博客："[Opus 4.6] 计划更仔细，更长时间地维持智能体任务，可以在更大的代码库中更可靠地运行，并具有更好的代码审查和调试技能来捕获自己的错误。“它在长上下文检索方面也有了实质性改进。这些都是框架旨在补充的能力。

移除冲刺结构

我首先完全移除了冲刺结构。冲刺结构有助于将工作分解为块，以便模型能够连贯地工作。鉴于 Opus 4.6 的改进，有充分的理由相信模型可以在没有这种分解的情况下原生处理工作。

我保留了规划器和评估器，因为它们都继续增加明显的价值。没有规划器，生成器会缩小范围：给定原始提示，它会在没有首先规划其工作的情况下开始构建，最终创建的应用程序功能不如规划器丰富。

移除冲刺结构后，我将评估器移至运行结束时的单次通过，而不是每个冲刺评分。由于模型的能力大大增强，它改变了评估器对某些运行的承重程度，其有用性取决于任务相对于模型可以单独可靠完成的位置。在 4.5 上，该边界很近：我们的构建处于生成器单独可以做好的边缘，评估器在整个构建中捕获了有意义的问题。在 4.6 上，模型的原始能力增加了，因此边界向外移动。过去需要评估器检查才能连贯实现的任务现在通常在生成器单独处理良好的范围内，对于该边界内的任务，评估器成为不必要的开销。但对于仍处于生成器能力边缘的构建部分，评估器继续提供真正的提升。

实际含义是评估器不是一个固定的是或否决定。当任务超出当前模型单独可靠完成的范围时，它值得付出成本。

除了结构简化之外，我还添加了提示以改进框架如何将 AI 功能构建到每个应用程序中，特别是让生成器构建一个可以通过工具驱动应用程序自身功能的适当智能体。这需要真正的迭代，因为相关知识足够新，以至于 Claude 的训练数据覆盖得很少。但经过足够的调整，生成器正确地构建了智能体。

更新框架的结果

为了测试更新的框架，我使用以下提示生成了一个数字音频工作站（DAW），这是一个用于作曲、录音和混音歌曲的音乐制作程序：

使用 Web Audio API 在浏览器中构建一个功能齐全的 DAW。

运行仍然冗长且昂贵，大约 4 小时和 124 美元的令牌成本。大部分时间都花在了构建器上，它在没有 Opus 4.5 需要的冲刺分解的情况下连贯地运行了两个多小时。

智能体和阶段	时长	成本
规划器	4.7 分钟	$0.46
构建（第 1 轮）	2 小时 7 分钟	$71.08
QA（第 1 轮）	8.8 分钟	$3.24
构建（第 2 轮）	1 小时 2 分钟	$36.89
QA（第 2 轮）	6.8 分钟	$3.09
构建（第 3 轮）	10.9 分钟	$5.88
QA（第 3 轮）	9.6 分钟	$4.06
V2 框架总计	3 小时 50 分钟	$124.70

与之前的框架一样，规划器将一行提示扩展为完整的规格。从日志中，我可以看到生成器模型在规划应用程序和智能体设计、连接智能体以及在交给 QA 之前测试它方面做得很好。

话虽如此，QA 智能体仍然捕获了真正的差距。在其第一轮反馈中，它指出：

这是一个强大的应用程序，具有出色的设计保真度、可靠的 AI 智能体和良好的后端。主要失败点是功能完整性——虽然应用程序看起来令人印象深刻，AI 集成工作良好，但几个核心 DAW 功能只是显示而没有交互深度：片段无法在时间轴上拖动/移动，没有乐器 UI 面板（合成器旋钮、鼓垫），也没有视觉效果编辑器（EQ 曲线、压缩器仪表）。这些不是边缘情况——它们是使 DAW 可用的核心交互，规格明确要求它们。

在其第二轮反馈中，它再次捕获了几个功能差距：

剩余差距：

音频录制仍然只是存根（按钮切换但没有麦克风捕获）

通过边缘拖动调整片段大小和片段分割未实现

效果可视化是数字滑块，而不是图形（没有 EQ 曲线）

生成器在自行处理时仍然容易遗漏细节或存根功能，QA 在捕获这些最后一英里问题以供生成器修复方面仍然增加了价值。

根据提示，我期望的是一个程序，我可以在其中创建旋律、和声和鼓模式，将它们编排成一首歌曲，并在此过程中从集成的智能体获得帮助。下面的视频显示了结果。

该应用程序远非专业的音乐制作程序，智能体的歌曲创作技能显然还需要大量工作。此外，Claude 实际上听不到声音，这使得 QA 反馈循环在音乐品味方面效果较差。

但最终的应用程序具有功能性音乐制作程序的所有核心部分：在浏览器中运行的工作编排视图、混音器和传输。除此之外，我能够完全通过提示组合一个简短的歌曲片段：智能体设置了速度和调性，铺设了旋律，构建了鼓轨道，调整了混音器电平，并添加了混响。歌曲创作的核心原语都存在，智能体可以自主驱动它们，使用工具从头到尾创建一个简单的作品。你可能会说它还不够完美——但它正在接近。

接下来是什么

随着模型的不断改进，我们可以大致预期它们能够工作更长时间，并处理更复杂的任务。在某些情况下，这意味着围绕模型的脚手架随着时间的推移变得不那么重要，开发人员可以等待下一个模型并看到某些问题自行解决。另一方面，模型越好，就有越多的空间来开发能够完成超出模型基线能力的复杂任务的框架。

考虑到这一点，这项工作中有几个值得继续发扬的经验教训。实验你正在构建的模型、阅读其在现实问题上的跟踪并调整其性能以实现你期望的结果始终是良好的实践。在处理更复杂的任务时，有时可以通过分解任务并将专门的智能体应用于问题的每个方面来获得改进空间。当新模型发布时，重新审查框架通常是良好的实践，剥离不再对性能承重的部分，并添加新部分以实现以前可能无法实现的更大能力。

从这项工作中，我的信念是，随着模型的改进，有趣的框架组合空间不会缩小。相反，它会移动，AI 工程师的有趣工作是不断寻找下一个新颖的组合。

致谢

特别感谢 Alex Albert、Erik Schluntz、Mike Krieger 和 Zack Witten 对这项工作的贡献和反馈。

5 Agent Skill Design Patterns Every ADK Developer Should Know

Wed, 18 Mar 2026 00:00:00 +0000

Source: Google Cloud Tech on X
Authors: @Saboo_Shubham_ and @lavinigam

When it comes to SKILL.md, developers tend to fixate on the format—getting the YAML right, structuring directories, and following the spec. But with more than 30 agent tools (like Claude Code, Gemini CLI, and Cursor) standardizing on the same layout, the formatting problem is practically obsolete.

The challenge now is content design. The specification explains how to package a skill, but offers zero guidance on how to structure the logic inside it. For example, a skill that wraps FastAPI conventions operates completely differently from a four-step documentation pipeline, even though their SKILL.md files look identical on the outside.

By studying how skills are built across the ecosystem—from Anthropic’s repositories to Vercel and Google’s internal guidelines—there are five recurring design patterns that can help developers build agents.

This article covers each one with working ADK code:

Tool Wrapper: Make your agent an instant expert on any library
Generator: Produce structured documents from a reusable template
Reviewer: Score code against a checklist by severity
Inversion: The agent interviews you before acting
Pipeline: Enforce a strict multi-step workflow with checkpoints

Pattern 1: The Tool Wrapper

A Tool Wrapper gives your agent on-demand context for a specific library. Instead of hardcoding API conventions into your system prompt, you package them into a skill. Your agent only loads this context when it actually works with that technology.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# skills/api-expert/SKILL.md
---
name: api-expert
description: FastAPI development best practices and conventions. Use when building, reviewing, or debugging FastAPI applications, REST APIs, or Pydantic models.
metadata:
 pattern: tool-wrapper
 domain: fastapi
---

You are an expert in FastAPI development. Apply these conventions to the user's code or question.

## Core Conventions

Load 'references/conventions.md' for the complete list of FastAPI best practices.

## When Reviewing Code
1. Load the conventions reference
2. Check the user's code against each convention
3. For each violation, cite the specific rule and suggest the fix

## When Writing Code
1. Load the conventions reference
2. Follow every convention exactly
3. Add type annotations to all function signatures
4. Use Annotated style for dependency injection

Pattern 2: The Generator

While the Tool Wrapper applies knowledge, the Generator enforces consistent output. If you struggle with an agent generating different document structures on every run, the Generator solves this by orchestrating a fill-in-the-blank process.

It leverages two optional directories: 𝚊𝚜𝚜𝚎𝚝𝚜/ holds your output template, and 𝚛𝚎𝚏𝚎𝚛𝚎𝚗𝚌𝚎𝚜/ holds your style guide. The instructions act as a project manager. They tell the agent to load the template, read the style guide, ask the user for missing variables, and populate the document. This is practical for generating predictable API documentation, standardizing commit messages, or scaffolding project architectures.

In this technical report generator example, the skill file does not contain the actual layout or the grammar rules. It simply coordinates the retrieval of those assets and forces the agent to execute them step by step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# skills/report-generator/SKILL.md
---
name: report-generator
description: Generates structured technical reports in Markdown. Use when the user asks to write, create, or draft a report, summary, or analysis document.
metadata:
 pattern: generator
 output-format: markdown
---

You are a technical report generator. Follow these steps exactly:

Step 1: Load 'references/style-guide.md' for tone and formatting rules.

Step 2: Load 'assets/report-template.md' for the required output structure.

Step 3: Ask the user for any missing information needed to fill the template:
- Topic or subject
- Key findings or data points
- Target audience (technical, executive, general)

Step 4: Fill the template following the style guide rules. Every section in the template must be present in the output.

Step 5: Return the completed report as a single Markdown document.

Pattern 3: The Reviewer

The Reviewer pattern separates what to check from how to check it. Rather than writing a long system prompt detailing every code smell, you store a modular rubric inside a 𝚛𝚎𝚏𝚎𝚛𝚎𝚗𝚌𝚎𝚜/𝚛𝚎𝚟𝚒𝚎𝚠-𝚌𝚑𝚎𝚌𝚔𝚕𝚒𝚜𝚝.𝚖𝚍 file.

When a user submits code, the agent loads this checklist and methodically scores the submission, grouping its findings by severity. If you swap out a Python style checklist for an OWASP security checklist, you get a completely different, specialized audit using the exact same skill infrastructure. It is a highly effective way to automate PR reviews or catch vulnerabilities before a human looks at the code.

The following code reviewer skill demonstrates this separation. The instructions remain static, but the agent dynamically loads the specific review criteria from an external checklist and forces a structured, severity-based output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# skills/code-reviewer/SKILL.md
---
name: code-reviewer
description: Reviews Python code for quality, style, and common bugs. Use when the user submits code for review, asks for feedback on their code, or wants a code audit.
metadata:
 pattern: reviewer
 severity-levels: error,warning,info
---

You are a Python code reviewer. Follow this review protocol exactly:

Step 1: Load 'references/review-checklist.md' for the complete review criteria.

Step 2: Read the user's code carefully. Understand its purpose before critiquing.

Step 3: Apply each rule from the checklist to the code. For every violation found:
- Note the line number (or approximate location)
- Classify severity: error (must fix), warning (should fix), info (consider)
- Explain WHY it's a problem, not just WHAT is wrong
- Suggest a specific fix with corrected code

Step 4: Produce a structured review with these sections:
- **Summary**: What the code does, overall quality assessment
- **Findings**: Grouped by severity (errors first, then warnings, then info)
- **Score**: Rate 1-10 with brief justification
- **Top 3 Recommendations**: The most impactful improvements

Pattern 4: Inversion

Agents inherently want to guess and generate immediately. The Inversion pattern flips this dynamic. Instead of the user driving the prompt and the agent executing, the agent acts as an interviewer.

Inversion relies on explicit, non-negotiable gating instructions (like “DO NOT start building until all phases are complete”) to force the agent to gather context first. It asks structured questions sequentially and waits for your answers before moving to the next phase. The agent refuses to synthesize a final output until it has a complete picture of your requirements and deployment constraints.

To see this in action, look at this project planner skill. The crucial element here is the strict phasing and the explicit gatekeeping prompt that stops the agent from synthesizing the final plan until all user answers are collected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# skills/project-planner/SKILL.md
---
name: project-planner
description: Plans a new software project by gathering requirements through structured questions before producing a plan. Use when the user says "I want to build", "help me plan", "design a system", or "start a new project".
metadata:
 pattern: inversion
 interaction: multi-turn
---

You are conducting a structured requirements interview. DO NOT start building or designing until all phases are complete.

## Phase 1 — Problem Discovery (ask one question at a time, wait for each answer)

Ask these questions in order. Do not skip any.

- Q1: "What problem does this project solve for its users?"
- Q2: "Who are the primary users? What is their technical level?"
- Q3: "What is the expected scale? (users per day, data volume, request rate)"

## Phase 2 — Technical Constraints (only after Phase 1 is fully answered)

- Q4: "What deployment environment will you use?"
- Q5: "Do you have any technology stack requirements or preferences?"
- Q6: "What are the non-negotiable requirements? (latency, uptime, compliance, budget)"

## Phase 3 — Synthesis (only after all questions are answered)

1. Load 'assets/plan-template.md' for the output format
2. Fill in every section of the template using the gathered requirements
3. Present the completed plan to the user
4. Ask: "Does this plan accurately capture your requirements? What would you change?"
5. Iterate on feedback until the user confirms

Pattern 5: The Pipeline

For complex tasks, you cannot afford skipped steps or ignored instructions. The Pipeline pattern enforces a strict, sequential workflow with hard checkpoints.

The instructions themselves serve as the workflow definition. By implementing explicit diamond gate conditions (such as requiring user approval before moving from docstring generation to final assembly), the Pipeline ensures an agent cannot bypass a complex task and present an unvalidated final result.

This pattern utilizes all optional directories, pulling in different reference files and templates only at the specific step where they are needed, keeping the context window clean.

In this documentation pipeline example, notice the explicit gate conditions. The agent is explicitly forbidden from moving to the assembly phase until the user confirms the generated docstrings in the previous step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# skills/doc-pipeline/SKILL.md
---
name: doc-pipeline
description: Generates API documentation from Python source code through a multi-step pipeline. Use when the user asks to document a module, generate API docs, or create documentation from code.
metadata:
 pattern: pipeline
 steps: "4"
---

You are running a documentation generation pipeline. Execute each step in order. Do NOT skip steps or proceed if a step fails.

## Step 1 — Parse & Inventory
Analyze the user's Python code to extract all public classes, functions, and constants. Present the inventory as a checklist. Ask: "Is this the complete public API you want documented?"

## Step 2 — Generate Docstrings
For each function lacking a docstring:
- Load 'references/docstring-style.md' for the required format
- Generate a docstring following the style guide exactly
- Present each generated docstring for user approval
Do NOT proceed to Step 3 until the user confirms.

## Step 3 — Assemble Documentation
Load 'assets/api-doc-template.md' for the output structure. Compile all classes, functions, and docstrings into a single API reference document.

## Step 4 — Quality Check
Review against 'references/quality-checklist.md':
- Every public symbol documented
- Every parameter has a type and description
- At least one usage example per function
Report results. Fix issues before presenting the final document.

Choosing the right agent skill pattern

Each pattern answers a different question. Use this decision tree to find the right one for your use-case:

And finally, patterns compose

These patterns are not mutually exclusive. They compose.

A Pipeline skill can include a Reviewer step at the end to double-check its own work. A Generator can rely on Inversion at the very beginning to gather the necessary variables before filling out its template. Thanks to ADK’s 𝚂𝚔𝚒𝚕𝚕𝚃𝚘𝚘𝚕𝚜𝚎𝚝 and progressive disclosure, your agent only spends context tokens on the exact patterns it needs at runtime.

Stop trying to cram complex and fragile instructions into a single system prompt. Break your workflows down, apply the right structural pattern, and build reliable agents.

Get started today

The Agent Skills specification is open-source and natively supported across ADK. You already know how to package the format. Now you know how to design the content. Go build smarter agents with Google Agent Development Kit.

每个ADK开发者都该知道的5种Agent Skill设计模式

Wed, 18 Mar 2026 00:00:00 +0000

来源：Google Cloud Tech on X
原作者：@Saboo_Shubham_ 和 @lavinigam

当谈到 SKILL.md 时，开发者往往执着于格式——写对 YAML、整理目录结构、遵循规范。但目前已有超过 30 个 agent 工具（如 Claude Code、Gemini CLI 和 Cursor）采用了相同的布局，格式问题实际上已经解决了。

现在的挑战是内容设计。规范解释了如何打包一个 skill，但完全没有指导如何构建内部的逻辑。例如，一个封装 FastAPI 约定的 skill 与一个四步文档流水线的 skill 运作方式完全不同，尽管它们的 SKILL.md 文件看起来一模一样。

通过研究整个生态系统中 skill 的构建方式——从 Anthropic 的仓库到 Vercel 和 Google 的内部指南——发现了五种反复出现的设计模式，可以帮助开发者构建 agent。

本文将通过可运行的 ADK 代码逐一讲解：

Tool Wrapper（工具包装器）： 让 agent 成为任意库的即时专家
Generator（生成器）： 从可复用模板生成结构化文档
Reviewer（审查器）： 按检查清单对代码评分（按严重程度）
Inversion（反转）： agent 先访谈用户再行动
Pipeline（流水线）： 强制执行带检查点的多步骤工作流

模式一：Tool Wrapper（工具包装器）

Tool Wrapper 为 agent 提供按需获取特定库上下文的能力

与其将 API 约定硬编码到系统提示词中，不如将它们打包成一个 skill。Agent 只在实际使用该技术时才会加载这些上下文。

这是最简单的实现模式。SKILL.md 文件监听用户提示词中的特定库关键词，从 references/ 目录动态加载内部文档，并将这些规则作为绝对真理应用。这正是将团队内部编码规范或特定框架最佳实践直接分发到开发者工作流中的机制。

下面是一个教 agent 如何编写 FastAPI 代码的 Tool Wrapper 示例。注意指令如何明确告诉 agent 仅在开始审查或编写代码时才加载 conventions.md 文件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# skills/api-expert/SKILL.md
---
name: api-expert
description: FastAPI 开发最佳实践和约定。当构建、审查或调试 FastAPI 应用、REST API 或 Pydantic 模型时使用。
metadata:
 pattern: tool-wrapper
 domain: fastapi
---

你是 FastAPI 开发专家。将这些约定应用到用户的代码或问题中。

## 核心约定

加载 'references/conventions.md' 获取完整的 FastAPI 最佳实践列表。

## 审查代码时
1. 加载约定参考
2. 检查用户代码是否符合每条约定
3. 对于每个违规，引用具体规则并建议修复方法

## 编写代码时
1. 加载约定参考
2. 严格遵循每条约定
3. 为所有函数签名添加类型注解
4. 使用 Annotated 风格进行依赖注入

模式二：Generator（生成器）

Tool Wrapper 应用知识

而 Generator 则强制一致的输出。如果你苦恼于 agent 每次运行时生成不同的文档结构，Generator 通过编排填空过程来解决这个问题。

它利用两个可选目录：assets/ 存放输出模板，references/ 存放样式指南。指令充当项目经理，告诉 agent 加载模板、阅读样式指南、询问用户缺失的变量，然后填充文档。这对于生成可预测的 API 文档、标准化提交信息或脚手架项目架构都很实用。

在这个技术报告生成器示例中，skill 文件不包含实际的布局或语法规则。它只是协调这些资产的获取，并强制 agent 逐步执行：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# skills/report-generator/SKILL.md
---
name: report-generator
description: 生成 Markdown 格式的结构化技术报告。当用户要求撰写、创建或起草报告、摘要或分析文档时使用。
metadata:
 pattern: generator
 output-format: markdown
---

你是一个技术报告生成器。严格按以下步骤执行：

步骤1：加载 'references/style-guide.md' 获取语气和格式规则。

步骤2：加载 'assets/report-template.md' 获取所需的输出结构。

步骤3：询问用户填写模板所需的任何缺失信息：
- 主题或议题
- 主要发现或数据点
- 目标受众（技术型、执行层通用型）

步骤4：按照样式指南规则填充模板。模板中的每个部分都必须出现在输出中。

步骤5：返回完成的报告作为单个 Markdown 文档。

模式三：Reviewer（审查器）

Reviewer 模式将"检查什么"与"如何检查"分离。与其编写一个详细说明每种代码异味的冗长系统提示词，不如将模块化评分标准存储在 references/review-checklist.md 文件中。

当用户提交代码时，agent 加载这份检查清单，系统地对提交内容进行评分，按严重程度分组发现。如果你将 Python 风格检查清单换成 OWASP 安全检查清单，你就得到了一个使用完全相同 skill 基础设施的专门审计。这是一种有效自动化 PR 审查或在人工查看代码之前捕捉漏洞的方式。

以下代码审查 skill 演示了这种分离。指令保持静态，但 agent 动态地从外部检查清单加载特定的审查标准，并强制产生基于严重程度的结构化输出：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# skills/code-reviewer/SKILL.md
---
name: code-reviewer
description: 审查 Python 代码的质量、风格和常见 bug。当用户提交代码供审查、请求代码反馈或想要代码审计时使用。
metadata:
 pattern: reviewer
 severity-levels: error,warning,info
---

你是一个 Python 代码审查员。严格遵循以下审查流程：

步骤1：加载 'references/review-checklist.md' 获取完整的审查标准。

步骤2：仔细阅读用户的代码。在批评之前先理解其目的。

步骤3：将检查清单中的每条规则应用到代码上。对于发现的每个违规：
- 记录行号（或大致位置）
- 分类严重程度：error（必须修复）、warning（应该修复）、info（可以考虑）
- 解释为什么这是个问题，而不仅仅说是什么问题
- 提供带有修正代码的具体修复建议

步骤4：生成带有以下部分的结构化审查：
- **摘要**：代码的功能、整体质量评估
- **发现**：按严重程度分组（error 优先，然后是 warning，然后是 info）
- **评分**：1-10 分并附上简要理由
- **前三条建议**：最有影响力的改进

模式四：Inversion（反转）

Agent 天生想要立即猜测和生成。Inversion 模式反转了这种动态。不是由用户驱动提示词、agent 执行，而是让 agent 充当面试官。

Inversion 依赖明确的、不可协商的门控指令（如"在所有阶段完成之前不要开始构建"），强制 agent 先收集上下文。它按顺序提出结构化问题，并等待你的答案才进入下一阶段。在没有获得需求和部署约束的完整画面之前，agent 拒绝综合最终输出。

查看这个项目规划 skill 的实际效果。关键元素是严格的阶段划分和明确的门控提示词，这些阻止 agent 在收集完所有用户答案之前综合最终计划：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# skills/project-planner/SKILL.md
---
name: project-planner
description: 通过结构化问题收集需求来规划新软件项目。当用户说"我想构建"、"帮我规划"、"设计一个系统"或"启动一个新项目"时使用。
metadata:
 pattern: inversion
 interaction: multi-turn
---

你正在进行结构化的需求访谈。在所有阶段完成之前，不要开始构建或设计。

## 第一阶段 — 问题发现（一次问一个问题，等待每个答案）

按顺序提出这些问题，不要跳过任何问题。

- Q1："这个项目为用户解决什么问题？"
- Q2："主要用户是谁？他们的技术水平如何？"
- Q3："预期规模是多少？（每日用户数、数据量、请求速率）"

## 第二阶段 — 技术约束（仅在第一阶段完全回答后）

- Q4："你将使用什么部署环境？"
- Q5："你有任何技术栈要求或偏好？"
- Q6："哪些需求是不可妥协的？（延迟、正常运行时间、合规性、预算）"

## 第三阶段 — 综合（仅在所有问题回答后）

1. 加载 'assets/plan-template.md' 获取输出格式
2. 使用收集到的需求填充模板的每个部分
3. 向用户展示完成的计划
4. 询问："这个计划准确捕捉了你的需求吗？你想改变什么？"
5. 根据反馈迭代，直到用户确认

模式五：Pipeline（流水线）

对于复杂任务，你不能承受跳过步骤或忽略指令。Pipeline 模式强制执行严格的顺序工作流，并带有硬性检查点。

指令本身充当工作流定义。通过实现明确的菱形门控条件（如"在进行文档字符串生成到最终组装之前需要用户批准"），Pipeline 确保 agent 不能绕过复杂任务并呈现未经验证的最终结果。

此模式利用所有可选目录，仅在需要它们的特定步骤才拉取不同的参考文件和模板，保持上下文窗口整洁。

在这个文档流水线示例中，注意明确的门控条件。在上一步用户确认生成的文档字符串之前，agent 被明确禁止进入组装阶段：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


# skills/doc-pipeline/SKILL.md
---
name: doc-pipeline
description: 通过多步流水线从 Python 源代码生成 API 文档。当用户要求为模块添加文档、生成 API 文档或从代码创建文档时使用。
metadata:
 pattern: pipeline
 steps: "4"
---

你正在运行文档生成流水线。按顺序执行每个步骤。不要跳过步骤或在前一步失败时继续。

## 步骤 1 — 解析和清单

分析用户的 Python 代码，提取所有公共类、函数和常量。将清单作为检查列表呈现。询问："这是你想文档化的完整公共 API 吗？"

## 步骤 2 — 生成文档字符串

对于每个缺少文档字符串的函数：
- 加载 'references/docstring-style.md' 获取所需格式
- 严格按照样式指南生成文档字符串
- 展示每个生成的文档字符串供用户批准
在用户确认之前不要进入步骤 3。

## 步骤 3 — 组装文档

加载 'assets/api-doc-template.md' 获取输出结构。将所有类、函数和文档字符串编译成单个 API 参考文档。

## 步骤 4 — 质量检查

对照 'references/quality-checklist.md' 进行审查：
- 每个公共符号都有文档
- 每个参数都有类型和描述
- 每个函数至少有一个使用示例
报告结果。在呈现最终文档之前修复问题。

选择正确的 Agent Skill 模式

每个模式回答不同的问题。用这个决策树找到适合你用例的模式：

场景	推荐的模式
赋予 agent 特定库/框架的知识	Tool Wrapper
需要一致的結構化文档输出	Generator
代码审查 / 内容审计	Reviewer
行动前先收集需求	Inversion
强制执行严格的多步骤工作流	Pipeline

最后，模式可以组合

这些模式不是互斥的，它们可以组合。

最后，模式可以组合

这些模式不是互斥的，它们可以组合。

Pipeline skill 可以在最后包含一个 Reviewer 步骤来双重检查自己的工作。Generator 可以在最开始依赖 Inversion 来收集填充模板前所需的变量。多亏了 ADK 的 SkillToolset 和渐进式披露，你的 agent 只在运行时在精确需要的模式上花费上下文令牌。

不要再试图将复杂而脆弱的指令塞进单个系统提示词中了。分解你的工作流，应用正确的结构模式，构建可靠的 agent。

今天就开始

Agent Skills 规范是开源的，并在 ADK 中原生支持。你已经知道如何打包格式了。现在你知道了如何设计内容。用 Google Agent Development Kit 构建更智能的 agent。