技术 on Wodaixin

找不到好用的Linux剪贴板，我做了个带AI的开源版本

Thu, 09 Apr 2026 00:00:00 +0000

项目地址：github.com/wodaixin/ClipGenius

被Windows升级"锁机"后的意外收获

2026年的某一天，Windows自动升级后，我的电脑被锁了。

不是病毒，不是硬件故障，而是微软的激活机制出了问题。折腾了几天，各种方法都试过，最后还是没能解决。那一刻我突然意识到：我对自己的电脑，竟然没有完全的控制权。

正好那段时间，我想体验一下OpenClaw、Claude等AI工具的自部署版本——这些工具在Linux上的体验更好，也更"私密"。于是我做了一个决定：彻底切换到Ubuntu。

切换的过程整体还算顺利，但有一个小问题让我很不适应：剪贴板。

在Windows上，我习惯了两个工具：

系统自带的Win+V剪贴板历史
开源项目PasteEx——可以快速将剪贴板内容保存为文件

这两个工具对于我这种需要频繁处理图片、文本的电商工作来说，简直是神器。

但在Ubuntu上，我试过很多剪贴板工具：Clipman、CopyQ、Diodon……它们要么功能简陋，要么界面老旧，要么不支持云端同步。作为电商从业者，我每天需要：

复制大量产品文案和描述
收集竞品的图片素材
保存Facebook广告资料库的视频链接
在多个设备间同步这些内容

找不到一个满意的工具，让我的工作效率大打折扣。

意外的转机：Google AI Studio

前段时间，我开始在Google AI Studio上尝试用AI辅助工作。当时只是想让AI帮我整理一些文案，没想到Google主动推荐了一些AI功能的使用场景。

这给了我一个想法：既然找不到好用的剪贴板工具，为什么不自己做一个？而且，既然AI这么强大，为什么不让剪贴板也变得智能一点？

于是，ClipGenius就这样诞生了。

最初的版本很简单，就是一个能记录剪贴历史的网页应用。但随着开发的深入，我逐渐加入了：

AI自动分析内容
多模态聊天功能
云端同步
图片生成
语音对话

现在回头看，它已经远远超出了我最初的设想。

ClipGenius能做什么？

1. 智能内容分类

ClipGenius会自动识别你复制的内容类型，分为6种：

类型	识别逻辑	使用场景
图片	文件类型检测	产品图、设计稿、截图
视频	文件类型检测	本地视频文件
URL	正则匹配	网页链接、参考资料
Markdown	特征检测（#、**、```等）	文档、笔记
代码	15+种语言识别	代码片段、配置文件
文本	默认类型	文案、备注

不需要手动分类，复制后自动归档。对于我这种每天要处理几十上百条内容的人来说，这个功能省了太多时间。

2. AI自动分析

登录后启用自动分析，AI会为每条内容生成：

智能文件名：比如img_20260409_143052或product_description_summary
内容摘要：一段话概括内容要点

这个功能特别适合整理素材。以前我保存一堆图片，文件名都是IMG_1234.jpg，根本不知道是什么。现在AI会自动生成有意义的名字，比如nike_running_shoes_ad，一目了然。

AI分析支持两个提供商：

Gemini：支持文本、图片、视频（推荐）
Minimax：仅支持文本

3. 多模态AI聊天

这是我最喜欢的功能之一。

你可以把任何剪贴内容附加到对话中，然后和AI聊天：

分析图片：“这张产品图的设计风格是什么？”
优化文案：“帮我把这段描述改得更吸引人”
解释代码：“这段代码是做什么的？”
翻译内容：“把这段英文翻译成中文”

AI会流式输出回答，甚至会展示"思考过程"（Gemini的thinking功能）。对话历史会自动保存，可以随时回顾。

4. 特色功能：FB广告视频一键下载

作为电商从业者，我经常需要研究竞品的Facebook广告素材。FB广告资料库虽然公开，但视频下载很麻烦——通常需要打开开发者工具，找到视频CDN链接，然后手动下载。

ClipGenius简化了这个流程：

在FB广告资料库找到视频
复制视频的CDN链接（通常是scontent-xxx.xx.fbcdn.net开头）
粘贴到ClipGenius
自动识别并下载为本地视频

下载机制很智能：

优先使用系统代理（如果你用Clash等工具）
降级到CORS代理
失败则保存为URL链接

注意：目前这个功能只支持Facebook CDN链接。未来计划集成yt-dlp等工具，支持更多视频平台。

5. 云端同步（可选）

重要说明：ClipGenius无需登录即可使用，所有功能在本地都能正常工作。登录只是为了启用云端同步功能。

如果你需要跨设备同步，可以使用Firebase：

实时同步：在电脑上复制，手机上立即可见
本地优先：所有数据先存到IndexedDB，即使离线也能用
访客模式：不登录也能使用，数据只存本地
免费额度：Firestore提供慷慨的免费额度，个人使用完全够用

同步限制：

✅ 支持：文本、URL、Markdown、代码
✅ 支持：图片（base64编码，单个文件<1MB）
❌ 不支持：视频文件（超过1MB限制）

Firestore单个文档大小限制为1MB，因此大文件（如视频）无法同步。这些文件只会保存在本地IndexedDB中。

扩展建议：

如果你有技术能力，可以自行扩展：

替换为其他云存储（如AWS S3、阿里云OSS）
集成更多AI提供商（Claude、OpenAI等）
添加文件压缩和分片上传
实现增量同步机制

项目代码结构清晰，修改起来不难。这也是开源的意义所在。

同步策略是"双写"：本地变更立即写入IndexedDB，同时上传到Firestore；远程变更会覆盖本地（云端优先）。这样既保证了响应速度，又避免了数据丢失。

6. 图片生成

基于Gemini的文字转图像功能：

标准模式：使用免费的Gemini API
专业模式：使用付费的AI Studio密钥

虽然我不常用这个功能，但偶尔需要快速生成一张配图时，还是挺方便的。

7. 语音对话

基于Gemini 3.1 Flash Live的实时语音交互。可以直接和AI语音对话，适合开车或不方便打字的场景。

实际使用场景

场景1：产品文案管理

电商工作中，我需要为不同平台准备不同版本的产品描述。以前是复制到记事本，现在直接用ClipGenius：

复制多个版本的文案
AI自动生成摘要（比如"短版文案"、“详细描述”）
需要时快速搜索和复用
云端同步，手机上也能查看

场景2：竞品分析素材收集

研究竞品时，我会收集大量素材：

产品图片：自动分类为image
广告视频：FB链接自动下载
落地页链接：保存为URL
文案描述：保存为text

所有内容都有AI生成的摘要，后续整理时一目了然。

场景3：代码片段管理

虽然我不是专业程序员，但开发ClipGenius的过程中，我也积累了不少常用代码片段：

Firebase配置代码
React组件模板
CSS样式片段

ClipGenius会自动识别代码语言，提供语法高亮，比传统的代码片段工具更直观。

场景4：跨设备工作流（需要登录）

如果你配置了Firebase并登录，可以实现跨设备同步：

早上在手机上看到好的广告案例，复制链接
到公司后，电脑上的ClipGenius已经同步了
点开链接，分析创意，记录笔记
笔记也会同步回手机，随时可以查看

注意：视频文件不会同步，只会保存在本地。如果你主要处理文本、图片、链接，同步功能会很方便。

技术实现

技术栈

类别	技术选择	原因
前端框架	React 19 + Vite	最新特性，开发体验好
样式	Tailwind CSS v4	快速开发，易于定制
AI SDK	@google/genai	官方SDK，功能完整
后端	Firebase	免费额度足够，实时同步
本地存储	IndexedDB	离线可用，容量大
动画	motion/react	流畅的交互体验
国际化	i18next	支持中英文切换

架构设计

ClipGenius采用"本地优先"的架构：

1
2
3
4
5
6
7


用户操作
 ↓
立即写入 IndexedDB（本地）
 ↓
异步上传到 Firestore（云端）
 ↓
其他设备实时接收更新

这样设计的好处：

响应快：不需要等待网络请求
离线可用：访客模式完全本地化
数据安全：本地和云端双重备份

智能内容检测

内容类型检测是核心功能之一。以代码检测为例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


function detectCodeLanguage(text: string): string | null {
 const rules: [RegExp, string][] = [
 [/^\s*\{[\s\S]*\}\s*$/, "json"],
 [/^def |^from .+ import/, "python"],
 [/^func |^package |:= /, "go"],
 [/^fn |^let mut |^impl /, "rust"],
 // ... 15+ 种语言规则
 ];
 
 for (const [regex, lang] of rules) {
 if (regex.test(text)) return lang;
 }
 return null;
}

通过正则表达式匹配语言特征，准确率很高。

FB视频下载机制

视频下载的实现比较有意思：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


async function downloadFBVideo(url: string) {
 // 1. 先尝试直接fetch（会使用系统代理）
 try {
 const response = await fetch(url);
 if (response.ok) {
 return await response.blob();
 }
 } catch (err) {
 console.log("Direct fetch failed");
 }
 
 // 2. 降级到CORS代理
 const proxyUrl = `https://corsproxy.io/?${encodeURIComponent(url)}`;
 const response = await fetch(proxyUrl);
 return await response.blob();
}

这种"优雅降级"的策略，保证了在不同网络环境下都能工作。

如何开始使用

部署方式

ClipGenius是开源项目，需要自行部署。最简单的方式是使用Docker：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# 克隆项目
git clone https://github.com/wodaixin/ClipGenius.git
cd ClipGenius

# 配置环境变量
cp .env.example .env
# 编辑 .env，填入 Firebase 和 Gemini API 配置

# 使用 Docker Compose 启动
docker-compose up -d

访问 http://localhost:8080 即可使用。

配置说明

需要准备两个API密钥：

Firebase配置（仅需要云同步时配置）
- 访问 Firebase Console
- 创建新项目
- 启用Authentication（Google登录）和Firestore
- 复制配置信息到.env
- 注意：Firestore免费额度对个人使用完全够用，但视频文件不会同步到云端
Gemini API密钥（需要AI功能时配置）
- 访问 Google AI Studio
- 创建API密钥
- 填入.env的VITE_GEMINI_API_KEY

如果你只想本地使用，可以不配置Firebase，所有功能都能正常工作。

详细配置步骤见项目文档。

为什么不提供在线服务？

有人可能会问：为什么不做成SaaS，直接提供在线服务？

原因有几个：

成本考虑：AI API调用、Firebase存储都需要成本，免费提供不现实
隐私保护：剪贴板内容可能包含敏感信息，自部署更安全
开源精神：我希望这是一个思路分享，而不是商业产品
可定制性：自部署可以根据需求修改代码，添加功能
项目定位：这是个人实验项目，不是成熟产品

如果你有技术基础，自部署其实很简单。如果没有，也可以部署到Google Cloud Run等平台，成本很低（甚至可能在免费额度内）。

开发过程中的收获

1. AI不是万能的，但很有用

最初我对AI的期望很高，以为它能自动完成所有开发工作。实际上：

AI擅长写重复性代码（组件模板、CRUD逻辑）
AI不擅长架构设计和复杂逻辑
需要人来把控方向和质量

但即便如此，AI也大大提升了开发效率。很多以前需要查文档的问题，现在直接问AI就能解决。

2. 开源项目的文档很重要

ClipGenius的文档写了很久，包括：

快速入门指南
功能使用说明
架构设计文档
API参考手册
部署指南

好的文档不仅帮助用户，也帮助自己理清思路。

3. 模块化设计便于扩展

项目采用了清晰的模块化结构：

AI提供商可插拔（Gemini、Minimax）
存储层可替换（IndexedDB、Firestore）
组件高度解耦

这意味着如果你想：

替换为其他云存储（AWS S3、阿里云OSS）
添加新的AI提供商（Claude、OpenAI）
修改UI样式和交互

都可以很方便地实现，不需要大规模重构。

4. 用户反馈很宝贵

虽然项目刚开源不久，但已经收到一些反馈：

有人建议支持更多视频平台
有人希望增加标签功能
有人想要导出功能

这些反馈让我意识到，工具的价值在于解决真实问题，而不是堆砌功能。

未来计划

ClipGenius还有很多可以改进的地方：

短期计划

支持更多视频平台（YouTube、抖音等）
增加标签和文件夹功能
支持批量导出
优化移动端体验

长期计划

集成更多AI提供商（Claude、OpenAI等）
支持团队协作功能
开发浏览器插件
开发桌面客户端（Electron）

如果你对这些功能感兴趣，欢迎贡献代码或提Issue。

写在最后

从Windows被"锁机"，到切换Ubuntu，再到开发ClipGenius，这个过程让我深刻体会到：

有时候，限制反而是创造的起点。

如果不是Windows升级出问题，我可能不会切换到Linux；如果不是Linux缺少好用的剪贴板工具，我可能不会开发ClipGenius；如果不是有了真实的需求，我可能不会深入学习React和AI技术。

我不是专业的前端开发者，也不是AI专家。但因为有真实的痛点，加上AI的辅助，我也能做出一个还算不错的工具。

关于项目的现状

需要坦诚地说：ClipGenius还有很多bug，代码也不够完善。

我开源这个项目，主要是想分享一个思路：如何用AI辅助开发，如何将剪贴板、AI、云同步结合起来。至于代码质量、功能完善度，还有很大的提升空间。

关于后续维护：

我可能不会持续修复公开版本的bug
即使修复，也可能只在我的私人版本中进行
如果你发现问题，欢迎提Issue，但不保证会及时响应
更欢迎你Fork后自己改进，这才是开源的意义

这不是一个"产品"，而是一个"实验"。如果它能给你一些启发，或者帮你解决一些问题，那就足够了。

如果你也在用Linux，也苦于找不到好用的剪贴板工具，不妨试试ClipGenius。如果你也想体验OpenClaw、Claude等AI工具的自部署版本，Ubuntu确实是个不错的选择。

如果你有任何建议或问题，欢迎在GitHub上提Issue或PR。但请理解，这只是我的一个个人项目，不是商业产品。

开源的意义，不在于代码有多完美，而在于分享思路和可能性。

项目链接：

GitHub：github.com/wodaixin/ClipGenius
文档：项目文档

技术栈：React 19 · Vite · Firebase · Gemini · Tailwind CSS · TypeScript

许可证：MIT License

卡帕西个人知识库构建方法深度分析

Tue, 07 Apr 2026 00:00:00 +0000

来源：Andrej Karpathy on X

作者：Andrej Karpathy

一、核心观点

Andrej Karpathy（卡帕西）提出了一种基于AI的个人知识库构建方法，其核心创新在于：

自动维护与更新：知识库能够自我更新和优化，不需要人工持续维护
循环增强机制：每次查询的结果都会被归档回知识库，形成知识积累的正向循环
从"记忆"到"检索"：改变了传统AI依赖大上下文窗口的模式，转而构建可持续查找的知识系统

卡帕西自己表示：“现在大部分Token都不是用来写代码，而是拿来跑知识库了”，这反映了AI应用范式的重要转变。

二、核心架构：从"黑盒模型"转向"显式知识系统"

在深入具体步骤之前，有必要先理解这套方法背后的架构哲学。

解耦存储与计算

传统AI应用的思路是把所有东西都塞进 Context Window，让模型"记住"。但长上下文存在一个被低估的问题：中间信息遗失（Lost in the Middle）——当上下文超过一定长度，模型对中间部分内容的注意力会显著衰减。

卡帕西方案的本质是一次架构层面的解耦：

知识库 = 外部硬盘（Disk）：持久化存储，可索引，可版本控制
大模型 = 中央处理器（CPU）：按需读取，计算推理，写回结果

这种设计让知识不再随会话结束而消失。每一次对话的上下文是临时的，但知识层是永续的。

范式转移：从 Memorization 到 Searchable Retrieval

旧范式	新范式
让模型"记住"所有信息	让系统"能快速找到"信息
依赖超大 Context Window	依赖高质量的索引和摘要
Token 成本随知识量线性增长	Token 成本与知识量解耦
会话结束，知识蒸发	每次查询都在沉淀知识

成本效率对比：为什么这套方案更"经济"

这是纯技术读者会关心的硬指标。假设知识库规模为 40 万字（约 30 万 Token）：

方案	单次查询 Token 消耗	预估成本	注意力的有效覆盖
全量塞入 Context	~300K tokens	$0.9/次 (Claude 3.5)	中间信息大概率被"遗忘"
RAG 检索 Wiki	~5K tokens (摘要+索引)	$0.015/次	精准命中相关内容

通过 RAG 检索 Wiki 摘要，单次推理成本可降低 90% 以上。更重要的是，通过 XML 标签隔离（见下文），可以强制模型只关注 <summary> 标签内容，有效规避长文本的"注意力衰减"问题。

核心技术价值：卡帕西方案利用大模型低廉的"推理能力"去置换昂贵的"人工整理成本"。通过构建一个自动维护、自我愈合的结构化知识层，我们实际上是在为未来的全自动 Agent 预留一个"高阶操作接口"。这不是在做笔记，而是在编写一个知识驱动的操作系统。

数据流架构：从 raw/ 到 wiki/ 的编译管线

用工程化的视角来描述整个数据流转：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


raw/ (原始数据)
 │
 ▼
┌─────────────────────────────────────┐
│ Compiler Agent │
│ - 语义分块 (Semantic Chunking) │
│ - 实体抽取 (Entity Extraction) │
│ - Embedding 生成 │
│ - Backlinks 建立 │
└─────────────────────────────────────┘
 │
 ▼
wiki/ (结构化知识库)
 │
 ├── index.json # 检索索引
 ├── embeddings.db # 向量存储
 └── entries/*.md # Markdown 源文件

编译条目的标准 Schema 定义（TypeScript 视角）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


interface WikiEntry {
 id: string; // 语义哈希 ID，基于内容生成
 title: string;
 summary: string; // AI 生成的摘要
 source_ref: string[]; // 原始文件路径关联，可追溯
 entities: string[]; // 提取的核心实体（用于知识图谱）
 backlinks: string[]; // 反向链接的目标条目 ID
 embedding_v: number[]; // 1536-dim 语义向量，用于相似度检索
 last_heal_timestamp: number; // Unix 时间戳，上次自我修复时间
 version: number; // 条目版本号，每次 Heal 递增
}

这个 Schema 的设计让知识库同时具备：

可追溯性：通过 source_ref 回到原始资料
可检索性：通过 embedding_v 进行语义搜索
可维护性：通过 last_heal_timestamp 追踪数据新鲜度

三、实施方法

第一步：导入数据

操作要点：

将所有原始资料（文章、论文、代码等）打包到一个文件夹（raw/）
无需人工预先整理分类
使用大模型将原始资料编译成结构化的维基百科格式

AI自动处理内容：

为每篇资料生成摘要
建立内容间的反向链接（[[Backlinks]]）
进行概念分类和归档
根据已有资料撰写新的综合性内容

工具推荐：

Obsidian Web Clipper插件：一键将网页转换为Markdown文件，并下载图片到本地

第二步：前端查看数据

查看内容：

原始数据（raw/文件夹）
编译好的维基百科
生成的可视化图表

推荐工具：

Obsidian：作为浏览面板，支持多种插件（如Marp生成幻灯片）

关键特点：

维基内容几乎全部由大模型编写和维护
人工很少直接修改内容

技术增强：前端数据标注 - XML 标签隔离法

在展示结构化数据时，采用 Anthropic 推荐的 XML 标签隔离法可以显著提升 Agent 的定位精度：

1
2
3
4
5
6


<kb_entry id="rag_basics">
 <title>RAG 基础架构</title>
 <summary>检索增强生成的三层结构：Embedding、Retrieval、Generation</summary>
 <backlinks>[[向量数据库]] [[语义搜索]]</backlinks>
 <last_updated>2026-04-01</last_updated>
</kb_entry>

XML 标签法的工程优势：

边界清晰：Agent 能精确识别知识条目的起止位置，避免上下文串扰
字段可解析：每个标签对应明确的语义，降低解析歧义
注意力聚焦：可以在 Prompt 中明确指示"只阅读 <summary> 标签"，强制模型跳过冗余信息
便于批处理：后续 Lint 扫描时可按标签类型分类处理

第三步：使用与循环优化

使用方式：

直接向AI提问，AI会检索知识库并给出答案
卡帕西的实践：100篇文章（约40万字）的知识库，无需复杂的RAG技术
关键：大模型维护好索引文件和摘要即可高效检索

技术增强：检索层的性能优化

在高频查询场景下，检索层的设计直接影响用户体验。建议引入以下工程实践：

索引缓存层

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


查询请求
 │
 ▼
┌─────────────────┐
│ Redis Cache │ ← 高频查询命中率 > 80%
└─────────────────┘
 │ (miss)
 ▼
┌─────────────────┐
│ Embedding DB │ ← 语义相似度检索
└─────────────────┘
 │
 ▼
返回 Top-K 条目 + 更新缓存 (TTL: 1h)

Redis 缓存热点条目：对于反复出现的查询模式，缓存检索结果
TTL 策略：根据条目更新频率动态调整过期时间
预取机制：当检测到某条目被频繁访问时，预加载其 Backlinks 关联内容

Git 版本控制下的并发写

由于知识库是一个"运行系统"，多个 Agent 可能同时执行 Heal 操作。使用 Git 进行版本管理是最优雅的解决方案：

1
2
3
4
5
6


# AI 执行 Heal 操作的底层 Git 流程
git checkout -b heal/rag-definition-20260409
# ... AI 修改文件 ...
git commit -m "[Heal] 统一 RAG 定义，修复与向量检索章节的冲突"
git push origin heal/rag-definition-20260409
# 触发 CI 重新构建索引

这样做的好处：

操作可追溯：每一次 AI 修改都有 commit hash
冲突可回滚：如果 Heal 引入错误，git revert 一键恢复
并发安全：分支机制避免多个 Agent 同时写主分支的冲突

循环增强机制：

每次查询的输出结果归档回维基
个人探索和提问不断在知识库中沉淀
知识库持续累积和优化

四、工程核心：Lint + Heal 自动化闭环

这是卡帕西方案中最具工程美感的部分。它不是简单的"定期整理"，而是一套完整的自动化质量保障系统。

机制拆解

机制名称	技术实现路径	预期效果
Lint（扫描检测）	定期启动 Agent 遍历整个 Markdown 库，利用 Embedding 相似度计算和语义分析，识别条目间的不一致性	自动发现逻辑冲突与过时信息
Heal（自我修复）	当检测到信息缺失或冲突时，自动触发外部 Search Tool（如 Perplexity API 或 Web Scraper）补齐数据，并重写冲突条目	保持知识库的"高保真度"与"实时性"
Backlinks 优化	模仿维基百科逻辑，AI 自动在不同概念间建立双向链接，强化知识图谱的连接密度	提升 RAG 检索时的路径联想能力

运作流程示例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


[定时任务触发 Lint]
 ↓
扫描 /wiki/ 目录下所有 .md 文件
 ↓
发现异常：
- "RAG 定义" 与 "向量检索原理" 中对相似度算法的描述不一致
- 某条目引用的论文链接已失效
- 新增 3 篇 raw/ 资料尚未编译入 Wiki
 ↓
[触发 Heal 流程]
 ↓
- 调用 Search Tool 查询权威来源，统一算法描述
- 自动查找失效链接的存档版本或替代来源
- 将 raw/ 中的新资料按模板编译入库
 ↓
[生成修复报告] → 存入 /changelog/

这套机制让知识库从"静态文档"变成了"活体系统"。它自己会发现问题，自己会想办法修复。

五、进阶集成：知识库作为 Agent 的"动态权重"

如果你在构建 Agent 系统，这套方法论可以与前沿框架做更深度的耦合。

与 MemGPT 结合：动态上下文注入

MemGPT 提出了"内存分页"的概念——把 LLM 的 Context Window 视为物理内存，把外部存储视为虚拟内存，通过分页调度突破上下文长度限制。

卡帕西的知识库恰好可以作为 MemGPT 的持久化存储层：

当 Agent 执行任务时，根据当前话题从知识库检索相关条目
将条目动态置换进活动窗口（类似操作系统换页）
任务结束后，产生的 insights 写回知识库

这意味着 Agent 拥有了跨越会话的"长期记忆"。

与 DSPy 结合：Prompt 编译基线

DSPy 的核心思想是把 Prompt 工程变成可优化的编译过程。

知识库中的高质量摘要可以作为 Few-shot 样本池：

DSPy 编译器从知识库中自动选取与当前任务最相关的历史案例
动态组装成 Few-shot Prompt
经过多轮优化后，编译出针对特定任务的最优系统指令

知识库在这里扮演了"训练数据"的角色——但它是活的、持续进化的。

六、循环增强机制的数学逻辑

知识库的价值增长不是线性的，而是复利型的：

1
2
3


输入归档 → 查询触发 → 反馈沉淀 → 索引增强 → 下次查询更精准
 ↑ ↓
 └──────────── 成本更低，准确率更高 ←──────────────┘

定量视角

假设：

初始知识库有 N 个条目，平均检索精度为 P₀
每次查询会产生 M 条新 insight 回写入库
经过 K 次查询后：
- 条目数量：N + K×M
- 知识网络密度：O((N + K×M)²) 级别的潜在连接
- 检索精度：随密度提升而单调递增
- 单次查询 Token 成本：因索引质量提升而下降

使用本身就是在建设。 这是这套方法论最优雅的地方——它把"维护成本"藏进了"使用收益"里。

七、核心价值与影响

1. 从"存储工具"到"运行系统"

传统知识库	卡帕西式知识库
需要人不断维护	大模型持续整理和更新
静态存储	动态演化
容易过时	自我优化

2. 对智能体（Agent）的影响

网友评价指出的关键优势：

持续存在的知识层：不再是临时共享内存，而是有生命力的知识库
无需无限上下文：只需要良好的文件组织和索引能力
更经济高效：比巨大提示词更便宜、扩展性更强
可检查性：更容易理解和调试

3. 知识积累的正向循环

“每个查询都让维基变得更好。它不断积累，现在这就像一个自我构建的第二大脑。”

八、总结

卡帕西的个人知识库方法展示了AI应用的新方向：

架构解耦：存储与计算分离，知识层持久化
自动化闭环：Lint + Heal 实现自我维护
范式转移：从 Memorization 到 Searchable Retrieval
成本优势：RAG 检索比全量塞入 Context 节省 90%+ Token
工程完备：Schema 设计、缓存层、Git 版本控制，具备生产级系统的骨架
Agent 基础设施：为智能体提供持续存在的知识层

卡帕西方案的核心技术价值在于：它利用大模型低廉的"推理能力"去置换昂贵的"人工整理成本"。通过构建一个自动维护、自我愈合的结构化知识层，我们实际上是在为未来的全自动 Agent 预留一个"高阶操作接口"。这不是在做笔记，而是在编写一个知识驱动的操作系统。

附录：技术术语速查

术语	解释
RAG	Retrieval-Augmented Generation，检索增强生成
Backlinks	反向链接，维基百科式的双向引用机制
Lint	代码/文档质量扫描工具，此处指知识库一致性检测
Token	LLM 处理文本的最小单位
Context Window	LLM 单次能处理的最大上下文长度
Embedding	将文本转换为向量表示的技术
Few-shot	在 Prompt 中提供少量示例以引导模型输出
TTL	Time To Live，缓存过期时间
Lost in the Middle	长上下文中模型对中间内容注意力衰减的现象

参考链接

Harness Design for Long-Running Application Development

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

Why naive implementations fall short

We’ve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the “Ralph Wiggum” method using hooks or scripts to keep agents in continuous iteration cycles.

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit “context anxiety,” in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps—addresses both these issues.

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn’t give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn’t sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn’t immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. “Is this design beautiful?” is hard to answer consistently, but “does this follow our principles for good design?” gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator’s critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn’t working.

Across runs, the evaluator’s assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didn’t fully anticipate. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn’t seen before from a single-pass generation.

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK’s automatic compaction handling context growth along the way.

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I’d observed in prior runs. The system contained the following agent personas:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs.

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what “done” looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

Harness	Duration	Cost
Solo	20 min	$9
Full harness	6 hr	$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I’d seen in the solo run did remain—the workflow still didn’t make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

Working through the editors, the new run’s advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because I’d asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract’s test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator’s findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

Contract criterion	Evaluator finding
Rectangle fill tool allows click-drag to fill a rectangular area with selected tile	FAIL — Tool only places tiles at drag start/end points instead of filling the region. `fillRectangle` function exists but isn’t triggered properly on mouseUp.
User can select and delete placed entity spawn points	FAIL — Delete key handler at `LevelEditor.tsx:892` requires both `selection` and `selectedEntityId` to be set, but clicking an entity only sets `selectedEntityId`. Condition should be `selection
User can reorder animation frames via API	FAIL — `PUT /frames/reorder` route defined after `/{frame_id}` routes. FastAPI matches ‘reorder’ as a frame_id integer and returns 422: “unable to parse string as an integer.”

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren’t a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator’s logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn’t exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn’t work, the lift was obvious.

Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as “find the simplest solution possible, and only increase complexity when needed,” and it’s a pattern that shows up consistently for anyone maintaining an agent harness.

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn’t able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: “[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes.” It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model’s raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator’s check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app’s own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude’s training data covers it thinly. But with enough tuning, the generator was building agents correctly.

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs. Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total V2 Harness	3 hr 50 min	$124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can’t be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren’t edge cases — they’re the core interactions that make a DAW usable, and the spec explicitly calls for them.

In its second round feedback, it again caught several functionality gaps:

Remaining gaps:

Audio recording is still stub-only (button toggles but no mic capture)

Clip resize by edge drag and clip split not implemented

Effect visualizations are numeric sliders, not graphical (no EQ curve)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agent’s song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you’re building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

Appendix

Example plan generated by planner agent:

RetroForge - 2D Retro Game Maker

Overview RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features

Project Dashboard & Management The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

Create a new game project with a name and description, so that I can begin designing my game
See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
Open any project to enter the full game editor workspace, so that I can work on my game
Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains: Project metadata (name, description, created/modified timestamps) Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144) Tile size configuration (8x8, 16x16, or 32x32 pixels) Color palette selection All associated sprites, tilesets, levels, and entity definitions

长期应用开发的Harness设计

Tue, 24 Mar 2026 00:00:00 +0000

作者：Prithvi Rajasekaran (Anthropic Labs Team)
发布日期：2026年3月24日

在过去几个月里，我一直在研究两个相互关联的问题：让 Claude 产出高质量的前端设计，以及让它在无需人工干预的情况下构建完整的应用程序。这项工作源于我们早期在前端设计技能和长期编码智能体框架上的努力，我和同事们通过提示工程和框架设计将 Claude 的性能提升到远超基线水平——但两者最终都遇到了瓶颈。

为了突破这一瓶颈，我寻找了适用于两个截然不同领域的新型 AI 工程方法，一个由主观品味定义，另一个由可验证的正确性和可用性定义。受生成对抗网络（GANs）的启发，我设计了一个包含生成器和评估器智能体的多智能体结构。构建一个能够可靠地——并且有品味地——评分输出的评估器，意味着首先要开发一套标准，能够将"这个设计好吗？“这样的主观判断转化为具体的、可评分的术语。

然后，我将这些技术应用于长期自主编码，延续了我们早期框架工作中的两个经验：将构建分解为可处理的块，以及使用结构化工件在会话之间传递上下文。最终结果是一个三智能体架构——规划器、生成器和评估器——在多小时的自主编码会话中产出了丰富的全栈应用程序。

为什么简单实现会失败

我们之前已经展示过，框架设计对长期智能体编码的有效性有着重大影响。在早期的实验中，我们使用初始化智能体将产品规格分解为任务列表，以及一个编码智能体逐个功能实现任务，然后传递工件以在会话之间传递上下文。更广泛的开发者社区也趋同于类似的见解，例如使用钩子或脚本让智能体保持持续迭代循环的"Ralph Wiggum"方法。

但一些问题仍然存在。对于更复杂的任务，智能体随着时间推移仍然倾向于偏离轨道。在分解这个问题时，我们观察到智能体执行此类任务时的两种常见失败模式。

首先是模型在冗长任务中随着上下文窗口填满而失去连贯性（参见我们关于上下文工程的文章）。一些模型还表现出"上下文焦虑”，即当它们接近自己认为的上下文限制时，会过早地开始收尾工作。上下文重置——完全清除上下文窗口并启动一个新的智能体，结合传递前一个智能体状态和下一步骤的结构化交接——解决了这两个问题。

这与压缩不同，压缩是将对话的早期部分就地总结，以便同一智能体可以在缩短的历史记录上继续工作。虽然压缩保持了连续性，但它不会给智能体一个干净的起点，这意味着上下文焦虑仍然可能持续存在。重置提供了一个干净的起点，代价是交接工件必须有足够的状态让下一个智能体能够顺利接手工作。在我们早期的测试中，我们发现 Claude Sonnet 4.5 表现出足够强的上下文焦虑，以至于仅靠压缩不足以实现强大的长任务性能，因此上下文重置成为框架设计的关键。这解决了核心问题，但为每次框架运行增加了编排复杂性、令牌开销和延迟。

第二个问题是自我评估，我们之前没有解决过。当被要求评估自己产出的工作时，智能体倾向于自信地赞扬这些工作——即使对人类观察者来说，质量明显平庸。这个问题在设计等主观任务上尤为突出，因为没有类似可验证软件测试的二元检查。布局是否感觉精致或普通是一个判断性问题，而智能体在评分自己的工作时可靠地倾向于积极评价。

然而，即使在确实有可验证结果的任务上，智能体有时仍然表现出糟糕的判断力，这会妨碍它们完成任务时的性能。将执行工作的智能体与评判工作的智能体分离，被证明是解决这个问题的有力杠杆。这种分离本身并不能立即消除那种宽容；评估器仍然是一个倾向于对 LLM 生成的输出慷慨的 LLM。但调整一个独立的评估器使其持怀疑态度，结果证明比让生成器批评自己的工作要容易得多，而一旦存在外部反馈，生成器就有了具体的迭代目标。

前端设计：让主观质量可评分

我从前端设计开始实验，因为自我评估问题在这里最为明显。在没有任何干预的情况下，Claude 通常倾向于安全、可预测的布局，这些布局在技术上是功能性的，但在视觉上并不出众。

两个见解塑造了我为前端设计构建的框架。首先，虽然美学不能完全简化为分数——个人品味总是会有所不同——但可以通过编码设计原则和偏好的评分标准来改进它们。“这个设计漂亮吗？“很难一致地回答，但"这是否遵循我们的良好设计原则？“给了 Claude 一些具体的评分依据。其次，通过将前端生成与前端评分分离，我们可以创建一个反馈循环，推动生成器产出更强的输出。

考虑到这一点，我编写了四个评分标准，并将它们提供给生成器和评估器智能体的提示中：

设计质量： 设计是否感觉像一个连贯的整体，而不是部分的集合？这方面的强大工作意味着颜色、排版、布局、图像和其他细节结合起来创造出独特的氛围和身份。

原创性： 是否有自定义决策的证据，还是这只是模板布局、库默认值和 AI 生成的模式？人类设计师应该能够识别出深思熟虑的创意选择。未修改的库存组件——或 AI 生成的明显迹象，如白色卡片上的紫色渐变——在这里会失败。

工艺： 技术执行：排版层次、间距一致性、色彩和谐、对比度。这是能力检查而不是创造力检查。大多数合理的实现默认情况下在这里表现良好；失败意味着基础被破坏。

功能性： 独立于美学的可用性。用户能否理解界面的功能，找到主要操作，并在不猜测的情况下完成任务？

我强调设计质量和原创性而不是工艺和功能性。Claude 在工艺和功能性上默认得分就很好，因为所需的技术能力往往是模型自然具备的。但在设计和原创性方面，Claude 经常产出充其量只能说是平淡的输出。这些标准明确惩罚高度通用的"AI 垃圾"模式，通过更重视设计和原创性，它推动模型进行更多的美学冒险。

我使用带有详细分数分解的少样本示例来校准评估器。这确保了评估器的判断与我的偏好一致，并减少了迭代之间的分数漂移。

我在 Claude Agent SDK 上构建了这个循环，这使得编排变得简单明了。生成器智能体首先根据用户提示创建 HTML/CSS/JS 前端。我给评估器提供了 Playwright MCP，让它在评分每个标准和撰写详细评论之前直接与实时页面交互。在实践中，评估器会自行浏览页面，在产生评估之前截图并仔细研究实现。该反馈作为下一次迭代的输入流回生成器。我每次生成运行 5 到 15 次迭代，每次迭代通常会随着生成器响应评估器的批评而将其推向更独特的方向。由于评估器是主动浏览页面而不是对静态截图评分，每个周期都需要实际的时钟时间。完整运行最长可达四个小时。我还指示生成器在每次评估后做出战略决策：如果分数趋势良好则完善当前方向，或者如果方法不起作用则完全转向不同的美学方向。

在各次运行中，评估器的评估在迭代中改善，然后趋于平稳，仍有改进空间。一些生成逐步完善。其他生成在迭代之间采取了急剧的美学转变。

标准的措辞以我没有完全预料到的方式引导了生成器。包含"最好的设计是博物馆级别的"这样的短语将设计推向了特定的视觉趋同，表明与标准相关的提示直接塑造了输出的特征。

虽然分数通常在迭代中提高，但模式并不总是清晰的线性。后期的实现往往整体上更好，但我经常看到我更喜欢中间迭代而不是最后一个的情况。实现复杂性也倾向于在各轮中增加，生成器响应评估器的反馈而寻求更雄心勃勃的解决方案。即使在第一次迭代中，输出也明显优于完全没有提示的基线，这表明标准和相关语言本身在任何评估器反馈导致进一步完善之前就将模型引导远离了通用默认值。

在一个值得注意的例子中，我提示模型为一家荷兰艺术博物馆创建一个网站。到第九次迭代时，它为一个虚构的博物馆制作了一个干净的深色主题登陆页面。该页面在视觉上很精致，但基本符合我的预期。然后，在第十个周期，它完全放弃了这种方法，将网站重新想象为一种空间体验：一个用 CSS 透视渲染的带有棋盘地板的 3D 房间，艺术品以自由形式的位置挂在墙上，以及基于门道的画廊房间之间的导航，而不是滚动或点击。这是我以前从未在单次生成中见过的那种创造性飞跃。

扩展到全栈编码

有了这些发现，我将这种受 GAN 启发的模式应用于全栈开发。生成器-评估器循环自然地映射到软件开发生命周期，其中代码审查和 QA 与设计评估器扮演相同的结构角色。

架构

在我们早期的长期运行框架中，我们通过初始化智能体、逐个功能工作的编码智能体以及会话之间的上下文重置来解决连贯的多会话编码问题。上下文重置是一个关键突破：该框架使用 Sonnet 4.5，它表现出前面提到的"上下文焦虑"倾向。创建一个在上下文重置中运行良好的框架是保持模型专注于任务的关键。Opus 4.5 在很大程度上自行消除了这种行为，因此我能够完全从这个框架中删除上下文重置。智能体在整个构建过程中作为一个连续会话运行，Claude Agent SDK 的自动压缩处理了上下文增长。

对于这项工作，我在原始框架的基础上构建了一个三智能体系统，每个智能体都解决了我在之前运行中观察到的特定差距。该系统包含以下智能体角色：

规划器： 我们之前的长期运行框架要求用户预先提供详细的规格。我想自动化这一步骤，所以我创建了一个规划器智能体，它接受一个简单的 1-4 句提示并将其扩展为完整的产品规格。我提示它对范围要有雄心，并专注于产品上下文和高层技术设计，而不是详细的技术实现。这种强调是因为担心如果规划器试图预先指定细粒度的技术细节并出错，规格中的错误会级联到下游实现中。让智能体专注于要产出的交付物并让它们在工作时找出路径似乎更明智。我还要求规划器寻找将 AI 功能融入产品规格的机会。

生成器： 早期框架中的逐个功能方法在范围管理方面效果很好。我在这里应用了类似的模型，指示生成器以冲刺方式工作，从规格中一次选择一个功能。每个冲刺使用 React、Vite、FastAPI 和 SQLite（后来是 PostgreSQL）堆栈实现应用程序，生成器被指示在每个冲刺结束时自我评估其工作，然后交给 QA。它还有 git 用于版本控制。

评估器： 早期框架的应用程序通常看起来令人印象深刻，但当你实际尝试使用它们时仍然有真正的错误。为了捕获这些错误，评估器使用 Playwright MCP 像用户一样点击运行中的应用程序，测试 UI 功能、API 端点和数据库状态。然后，它根据发现的错误和一套标准对每个冲刺进行评分，这套标准以前端实验为模型，在这里适应涵盖产品深度、功能性、视觉设计和代码质量。每个标准都有一个硬阈值，如果任何一个低于它，冲刺就会失败，生成器会得到关于出了什么问题的详细反馈。

在每个冲刺之前，生成器和评估器协商一个冲刺合同：在编写任何代码之前就该工作块的"完成"标准达成一致。这样做是因为产品规格是有意保持高层次的，我想要一个步骤来弥合用户故事和可测试实现之间的差距。生成器提出它将构建什么以及如何验证成功，评估器审查该提案以确保生成器正在构建正确的东西。两者迭代直到达成一致。

通信通过文件处理：一个智能体会写一个文件，另一个智能体会读取它并在该文件内或用前一个智能体将读取的新文件进行响应。然后生成器根据商定的合同进行构建，然后将工作交给 QA。这使工作忠实于规格，而不会过早地过度指定实现。

运行框架

对于这个框架的第一个版本，我使用了 Claude Opus 4.5，针对完整框架和单智能体系统运行用户提示进行比较。我使用 Opus 4.5 是因为这是我开始这些实验时我们最好的编码模型。

我编写了以下提示来生成一个复古视频游戏制作器：

创建一个 2D 复古游戏制作器，功能包括关卡编辑器、精灵编辑器、实体行为和可玩测试模式。

下表显示了框架类型、运行时长和总成本。

框架	时长	成本
单智能体	20 分钟	$9
完整框架	6 小时	$200

框架的成本超过 20 倍，但输出质量的差异立即显现。

我期望的是一个界面，我可以在其中构建关卡及其组成部分（精灵、实体、瓦片布局），然后点击播放来实际玩关卡。我首先打开了单智能体运行的输出，初始应用程序似乎符合这些期望。

然而，当我点击浏览时，问题开始出现。布局浪费空间，固定高度的面板使大部分视口空着。工作流程很僵硬。尝试填充关卡会提示我首先创建精灵和实体，但 UI 中没有任何东西引导我进入该序列。更重要的是，实际的游戏是坏的。我的实体出现在屏幕上，但没有任何东西响应输入。深入代码发现，实体定义和游戏运行时之间的连接是断开的，没有表面迹象表明问题出在哪里。

评估完单智能体运行后，我将注意力转向框架运行。这次运行从相同的一句话提示开始，但规划器步骤将该提示扩展为分布在十个冲刺中的 16 个功能规格。它远远超出了单智能体运行尝试的范围。除了核心编辑器和播放模式外，规格还要求精灵动画系统、行为模板、音效和音乐、AI 辅助的精灵生成器和关卡设计器，以及带有可共享链接的游戏导出。我给了规划器访问我们前端设计技能的权限，它阅读并使用它来为应用程序创建视觉设计语言作为规格的一部分。对于每个冲刺，生成器和评估器协商一个合同，定义冲刺的具体实现细节，以及将被测试以验证完成的可测试行为。

该应用程序立即显示出比单智能体运行更多的精致和流畅性。画布使用了完整的视口，面板大小合理，界面具有与规格中的设计方向一致的一致视觉身份。我在单智能体运行中看到的一些笨拙确实仍然存在——工作流程仍然没有明确表示你应该在尝试填充关卡之前构建精灵和实体，我不得不通过摸索来弄清楚这一点。这被解读为基础模型产品直觉的差距，而不是框架旨在解决的问题，尽管它确实表明了框架内有针对性的迭代可以进一步改善输出质量的地方。

浏览编辑器时，新运行相对于单智能体的优势变得更加明显。精灵编辑器更丰富、功能更全面，具有更清晰的工具调色板、更好的颜色选择器和更可用的缩放控件。

因为我要求规划器将 AI 功能融入其规格中，该应用程序还配备了内置的 Claude 集成，让我可以通过提示生成游戏的不同部分。这大大加快了工作流程。

最大的区别在于播放模式。我实际上能够移动我的实体并玩游戏。物理效果有一些粗糙的边缘——我的角色跳到平台上但最终与它重叠，这在直觉上感觉不对——但核心功能是有效的，而单智能体运行没有做到这一点。移动了一会儿后，我确实遇到了 AI 游戏关卡构建的一些限制。有一堵大墙我无法跳过，所以我被困住了。这表明框架可以处理一些常识性改进和边缘情况以进一步完善应用程序。

阅读日志，很明显评估器使实现与规格保持一致。每个冲刺，它都会遍历冲刺合同的测试标准，并通过 Playwright 执行运行中的应用程序，对任何偏离预期行为的内容提交错误。合同是细粒度的——仅 Sprint 3 就有 27 个涵盖关卡编辑器的标准——评估器的发现足够具体，可以在不进行额外调查的情况下采取行动。下表显示了我们的评估器识别的几个问题示例：

合同标准	评估器发现
矩形填充工具允许点击拖动以用选定的瓦片填充矩形区域	失败 — 工具仅在拖动开始/结束点放置瓦片，而不是填充区域。`fillRectangle` 函数存在但在 mouseUp 时未正确触发。
用户可以选择和删除放置的实体生成点	失败 — `LevelEditor.tsx:892` 的删除键处理程序需要同时设置 `selection` 和 `selectedEntityId`，但点击实体只设置 `selectedEntityId`。条件应该是 `selection
用户可以通过 API 重新排序动画帧	失败 — `PUT /frames/reorder` 路由在 `/{frame_id}` 路由之后定义。FastAPI 将 ‘reorder’ 匹配为 frame_id 整数并返回 422：“无法将字符串解析为整数。”

让评估器达到这个水平需要工作。开箱即用，Claude 是一个糟糕的 QA 智能体。在早期运行中，我看到它识别出合法的问题，然后说服自己决定它们不是什么大问题并批准工作。它还倾向于表面测试，而不是探测边缘情况，因此更微妙的错误经常漏掉。调整循环是阅读评估器的日志，找到其判断与我的判断不同的示例，并更新 QA 的提示以解决这些问题。经过几轮这样的开发循环，评估器才以我认为合理的方式进行评分。即便如此，框架输出显示了模型 QA 能力的局限性：小的布局问题、在某些地方感觉不直观的交互，以及评估器没有彻底执行的更深层嵌套功能中未发现的错误。显然还有更多的验证空间可以通过进一步调整来捕获。但与单智能体运行相比，应用程序的核心功能根本不起作用，提升是显而易见的。

迭代框架

第一组框架结果令人鼓舞，但它也很笨重、缓慢且昂贵。下一个合乎逻辑的步骤是找到简化框架而不降低其性能的方法。这部分是常识，部分是一个更普遍原则的功能：框架中的每个组件都编码了关于模型自身无法做什么的假设，这些假设值得压力测试，既因为它们可能不正确，也因为随着模型的改进它们可能很快过时。我们的博客文章《构建有效的智能体》将基本思想框定为"找到尽可能简单的解决方案，只有在需要时才增加复杂性”，这是任何维护智能体框架的人都会一致看到的模式。

在我第一次尝试简化时，我大幅削减了框架并尝试了一些创造性的新想法，但我无法复制原始框架的性能。也很难判断框架设计的哪些部分实际上是承重的，以及以什么方式。基于这一经验，我转向了一种更有条理的方法，一次删除一个组件并审查它对最终结果的影响。

当我经历这些迭代周期时，我们还发布了 Opus 4.6，这为减少框架复杂性提供了进一步的动力。有充分的理由期望 4.6 需要比 4.5 更少的脚手架。从我们的发布博客："[Opus 4.6] 计划更仔细，更长时间地维持智能体任务，可以在更大的代码库中更可靠地运行，并具有更好的代码审查和调试技能来捕获自己的错误。“它在长上下文检索方面也有了实质性改进。这些都是框架旨在补充的能力。

移除冲刺结构

我首先完全移除了冲刺结构。冲刺结构有助于将工作分解为块，以便模型能够连贯地工作。鉴于 Opus 4.6 的改进，有充分的理由相信模型可以在没有这种分解的情况下原生处理工作。

我保留了规划器和评估器，因为它们都继续增加明显的价值。没有规划器，生成器会缩小范围：给定原始提示，它会在没有首先规划其工作的情况下开始构建，最终创建的应用程序功能不如规划器丰富。

移除冲刺结构后，我将评估器移至运行结束时的单次通过，而不是每个冲刺评分。由于模型的能力大大增强，它改变了评估器对某些运行的承重程度，其有用性取决于任务相对于模型可以单独可靠完成的位置。在 4.5 上，该边界很近：我们的构建处于生成器单独可以做好的边缘，评估器在整个构建中捕获了有意义的问题。在 4.6 上，模型的原始能力增加了，因此边界向外移动。过去需要评估器检查才能连贯实现的任务现在通常在生成器单独处理良好的范围内，对于该边界内的任务，评估器成为不必要的开销。但对于仍处于生成器能力边缘的构建部分，评估器继续提供真正的提升。

实际含义是评估器不是一个固定的是或否决定。当任务超出当前模型单独可靠完成的范围时，它值得付出成本。

除了结构简化之外，我还添加了提示以改进框架如何将 AI 功能构建到每个应用程序中，特别是让生成器构建一个可以通过工具驱动应用程序自身功能的适当智能体。这需要真正的迭代，因为相关知识足够新，以至于 Claude 的训练数据覆盖得很少。但经过足够的调整，生成器正确地构建了智能体。

更新框架的结果

为了测试更新的框架，我使用以下提示生成了一个数字音频工作站（DAW），这是一个用于作曲、录音和混音歌曲的音乐制作程序：

使用 Web Audio API 在浏览器中构建一个功能齐全的 DAW。

运行仍然冗长且昂贵，大约 4 小时和 124 美元的令牌成本。大部分时间都花在了构建器上，它在没有 Opus 4.5 需要的冲刺分解的情况下连贯地运行了两个多小时。

智能体和阶段	时长	成本
规划器	4.7 分钟	$0.46
构建（第 1 轮）	2 小时 7 分钟	$71.08
QA（第 1 轮）	8.8 分钟	$3.24
构建（第 2 轮）	1 小时 2 分钟	$36.89
QA（第 2 轮）	6.8 分钟	$3.09
构建（第 3 轮）	10.9 分钟	$5.88
QA（第 3 轮）	9.6 分钟	$4.06
V2 框架总计	3 小时 50 分钟	$124.70

与之前的框架一样，规划器将一行提示扩展为完整的规格。从日志中，我可以看到生成器模型在规划应用程序和智能体设计、连接智能体以及在交给 QA 之前测试它方面做得很好。

话虽如此，QA 智能体仍然捕获了真正的差距。在其第一轮反馈中，它指出：

这是一个强大的应用程序，具有出色的设计保真度、可靠的 AI 智能体和良好的后端。主要失败点是功能完整性——虽然应用程序看起来令人印象深刻，AI 集成工作良好，但几个核心 DAW 功能只是显示而没有交互深度：片段无法在时间轴上拖动/移动，没有乐器 UI 面板（合成器旋钮、鼓垫），也没有视觉效果编辑器（EQ 曲线、压缩器仪表）。这些不是边缘情况——它们是使 DAW 可用的核心交互，规格明确要求它们。

在其第二轮反馈中，它再次捕获了几个功能差距：

剩余差距：

音频录制仍然只是存根（按钮切换但没有麦克风捕获）

通过边缘拖动调整片段大小和片段分割未实现

效果可视化是数字滑块，而不是图形（没有 EQ 曲线）

生成器在自行处理时仍然容易遗漏细节或存根功能，QA 在捕获这些最后一英里问题以供生成器修复方面仍然增加了价值。

根据提示，我期望的是一个程序，我可以在其中创建旋律、和声和鼓模式，将它们编排成一首歌曲，并在此过程中从集成的智能体获得帮助。下面的视频显示了结果。

该应用程序远非专业的音乐制作程序，智能体的歌曲创作技能显然还需要大量工作。此外，Claude 实际上听不到声音，这使得 QA 反馈循环在音乐品味方面效果较差。

但最终的应用程序具有功能性音乐制作程序的所有核心部分：在浏览器中运行的工作编排视图、混音器和传输。除此之外，我能够完全通过提示组合一个简短的歌曲片段：智能体设置了速度和调性，铺设了旋律，构建了鼓轨道，调整了混音器电平，并添加了混响。歌曲创作的核心原语都存在，智能体可以自主驱动它们，使用工具从头到尾创建一个简单的作品。你可能会说它还不够完美——但它正在接近。

接下来是什么

随着模型的不断改进，我们可以大致预期它们能够工作更长时间，并处理更复杂的任务。在某些情况下，这意味着围绕模型的脚手架随着时间的推移变得不那么重要，开发人员可以等待下一个模型并看到某些问题自行解决。另一方面，模型越好，就有越多的空间来开发能够完成超出模型基线能力的复杂任务的框架。

考虑到这一点，这项工作中有几个值得继续发扬的经验教训。实验你正在构建的模型、阅读其在现实问题上的跟踪并调整其性能以实现你期望的结果始终是良好的实践。在处理更复杂的任务时，有时可以通过分解任务并将专门的智能体应用于问题的每个方面来获得改进空间。当新模型发布时，重新审查框架通常是良好的实践，剥离不再对性能承重的部分，并添加新部分以实现以前可能无法实现的更大能力。

从这项工作中，我的信念是，随着模型的改进，有趣的框架组合空间不会缩小。相反，它会移动，AI 工程师的有趣工作是不断寻找下一个新颖的组合。

致谢

特别感谢 Alex Albert、Erik Schluntz、Mike Krieger 和 Zack Witten 对这项工作的贡献和反馈。

5 Agent Skill Design Patterns Every ADK Developer Should Know

Wed, 18 Mar 2026 00:00:00 +0000

Source: Google Cloud Tech on X
Authors: @Saboo_Shubham_ and @lavinigam

When it comes to SKILL.md, developers tend to fixate on the format—getting the YAML right, structuring directories, and following the spec. But with more than 30 agent tools (like Claude Code, Gemini CLI, and Cursor) standardizing on the same layout, the formatting problem is practically obsolete.

The challenge now is content design. The specification explains how to package a skill, but offers zero guidance on how to structure the logic inside it. For example, a skill that wraps FastAPI conventions operates completely differently from a four-step documentation pipeline, even though their SKILL.md files look identical on the outside.

By studying how skills are built across the ecosystem—from Anthropic’s repositories to Vercel and Google’s internal guidelines—there are five recurring design patterns that can help developers build agents.

This article covers each one with working ADK code:

Tool Wrapper: Make your agent an instant expert on any library
Generator: Produce structured documents from a reusable template
Reviewer: Score code against a checklist by severity
Inversion: The agent interviews you before acting
Pipeline: Enforce a strict multi-step workflow with checkpoints

Pattern 1: The Tool Wrapper

A Tool Wrapper gives your agent on-demand context for a specific library. Instead of hardcoding API conventions into your system prompt, you package them into a skill. Your agent only loads this context when it actually works with that technology.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# skills/api-expert/SKILL.md
---
name: api-expert
description: FastAPI development best practices and conventions. Use when building, reviewing, or debugging FastAPI applications, REST APIs, or Pydantic models.
metadata:
 pattern: tool-wrapper
 domain: fastapi
---

You are an expert in FastAPI development. Apply these conventions to the user's code or question.

## Core Conventions

Load 'references/conventions.md' for the complete list of FastAPI best practices.

## When Reviewing Code
1. Load the conventions reference
2. Check the user's code against each convention
3. For each violation, cite the specific rule and suggest the fix

## When Writing Code
1. Load the conventions reference
2. Follow every convention exactly
3. Add type annotations to all function signatures
4. Use Annotated style for dependency injection

Pattern 2: The Generator

While the Tool Wrapper applies knowledge, the Generator enforces consistent output. If you struggle with an agent generating different document structures on every run, the Generator solves this by orchestrating a fill-in-the-blank process.

It leverages two optional directories: 𝚊𝚜𝚜𝚎𝚝𝚜/ holds your output template, and 𝚛𝚎𝚏𝚎𝚛𝚎𝚗𝚌𝚎𝚜/ holds your style guide. The instructions act as a project manager. They tell the agent to load the template, read the style guide, ask the user for missing variables, and populate the document. This is practical for generating predictable API documentation, standardizing commit messages, or scaffolding project architectures.

In this technical report generator example, the skill file does not contain the actual layout or the grammar rules. It simply coordinates the retrieval of those assets and forces the agent to execute them step by step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# skills/report-generator/SKILL.md
---
name: report-generator
description: Generates structured technical reports in Markdown. Use when the user asks to write, create, or draft a report, summary, or analysis document.
metadata:
 pattern: generator
 output-format: markdown
---

You are a technical report generator. Follow these steps exactly:

Step 1: Load 'references/style-guide.md' for tone and formatting rules.

Step 2: Load 'assets/report-template.md' for the required output structure.

Step 3: Ask the user for any missing information needed to fill the template:
- Topic or subject
- Key findings or data points
- Target audience (technical, executive, general)

Step 4: Fill the template following the style guide rules. Every section in the template must be present in the output.

Step 5: Return the completed report as a single Markdown document.

Pattern 3: The Reviewer

The Reviewer pattern separates what to check from how to check it. Rather than writing a long system prompt detailing every code smell, you store a modular rubric inside a 𝚛𝚎𝚏𝚎𝚛𝚎𝚗𝚌𝚎𝚜/𝚛𝚎𝚟𝚒𝚎𝚠-𝚌𝚑𝚎𝚌𝚔𝚕𝚒𝚜𝚝.𝚖𝚍 file.

When a user submits code, the agent loads this checklist and methodically scores the submission, grouping its findings by severity. If you swap out a Python style checklist for an OWASP security checklist, you get a completely different, specialized audit using the exact same skill infrastructure. It is a highly effective way to automate PR reviews or catch vulnerabilities before a human looks at the code.

The following code reviewer skill demonstrates this separation. The instructions remain static, but the agent dynamically loads the specific review criteria from an external checklist and forces a structured, severity-based output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# skills/code-reviewer/SKILL.md
---
name: code-reviewer
description: Reviews Python code for quality, style, and common bugs. Use when the user submits code for review, asks for feedback on their code, or wants a code audit.
metadata:
 pattern: reviewer
 severity-levels: error,warning,info
---

You are a Python code reviewer. Follow this review protocol exactly:

Step 1: Load 'references/review-checklist.md' for the complete review criteria.

Step 2: Read the user's code carefully. Understand its purpose before critiquing.

Step 3: Apply each rule from the checklist to the code. For every violation found:
- Note the line number (or approximate location)
- Classify severity: error (must fix), warning (should fix), info (consider)
- Explain WHY it's a problem, not just WHAT is wrong
- Suggest a specific fix with corrected code

Step 4: Produce a structured review with these sections:
- **Summary**: What the code does, overall quality assessment
- **Findings**: Grouped by severity (errors first, then warnings, then info)
- **Score**: Rate 1-10 with brief justification
- **Top 3 Recommendations**: The most impactful improvements

Pattern 4: Inversion

Agents inherently want to guess and generate immediately. The Inversion pattern flips this dynamic. Instead of the user driving the prompt and the agent executing, the agent acts as an interviewer.

Inversion relies on explicit, non-negotiable gating instructions (like “DO NOT start building until all phases are complete”) to force the agent to gather context first. It asks structured questions sequentially and waits for your answers before moving to the next phase. The agent refuses to synthesize a final output until it has a complete picture of your requirements and deployment constraints.

To see this in action, look at this project planner skill. The crucial element here is the strict phasing and the explicit gatekeeping prompt that stops the agent from synthesizing the final plan until all user answers are collected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# skills/project-planner/SKILL.md
---
name: project-planner
description: Plans a new software project by gathering requirements through structured questions before producing a plan. Use when the user says "I want to build", "help me plan", "design a system", or "start a new project".
metadata:
 pattern: inversion
 interaction: multi-turn
---

You are conducting a structured requirements interview. DO NOT start building or designing until all phases are complete.

## Phase 1 — Problem Discovery (ask one question at a time, wait for each answer)

Ask these questions in order. Do not skip any.

- Q1: "What problem does this project solve for its users?"
- Q2: "Who are the primary users? What is their technical level?"
- Q3: "What is the expected scale? (users per day, data volume, request rate)"

## Phase 2 — Technical Constraints (only after Phase 1 is fully answered)

- Q4: "What deployment environment will you use?"
- Q5: "Do you have any technology stack requirements or preferences?"
- Q6: "What are the non-negotiable requirements? (latency, uptime, compliance, budget)"

## Phase 3 — Synthesis (only after all questions are answered)

1. Load 'assets/plan-template.md' for the output format
2. Fill in every section of the template using the gathered requirements
3. Present the completed plan to the user
4. Ask: "Does this plan accurately capture your requirements? What would you change?"
5. Iterate on feedback until the user confirms

Pattern 5: The Pipeline

For complex tasks, you cannot afford skipped steps or ignored instructions. The Pipeline pattern enforces a strict, sequential workflow with hard checkpoints.

The instructions themselves serve as the workflow definition. By implementing explicit diamond gate conditions (such as requiring user approval before moving from docstring generation to final assembly), the Pipeline ensures an agent cannot bypass a complex task and present an unvalidated final result.

This pattern utilizes all optional directories, pulling in different reference files and templates only at the specific step where they are needed, keeping the context window clean.

In this documentation pipeline example, notice the explicit gate conditions. The agent is explicitly forbidden from moving to the assembly phase until the user confirms the generated docstrings in the previous step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# skills/doc-pipeline/SKILL.md
---
name: doc-pipeline
description: Generates API documentation from Python source code through a multi-step pipeline. Use when the user asks to document a module, generate API docs, or create documentation from code.
metadata:
 pattern: pipeline
 steps: "4"
---

You are running a documentation generation pipeline. Execute each step in order. Do NOT skip steps or proceed if a step fails.

## Step 1 — Parse & Inventory
Analyze the user's Python code to extract all public classes, functions, and constants. Present the inventory as a checklist. Ask: "Is this the complete public API you want documented?"

## Step 2 — Generate Docstrings
For each function lacking a docstring:
- Load 'references/docstring-style.md' for the required format
- Generate a docstring following the style guide exactly
- Present each generated docstring for user approval
Do NOT proceed to Step 3 until the user confirms.

## Step 3 — Assemble Documentation
Load 'assets/api-doc-template.md' for the output structure. Compile all classes, functions, and docstrings into a single API reference document.

## Step 4 — Quality Check
Review against 'references/quality-checklist.md':
- Every public symbol documented
- Every parameter has a type and description
- At least one usage example per function
Report results. Fix issues before presenting the final document.

Choosing the right agent skill pattern

Each pattern answers a different question. Use this decision tree to find the right one for your use-case:

And finally, patterns compose

These patterns are not mutually exclusive. They compose.

A Pipeline skill can include a Reviewer step at the end to double-check its own work. A Generator can rely on Inversion at the very beginning to gather the necessary variables before filling out its template. Thanks to ADK’s 𝚂𝚔𝚒𝚕𝚕𝚃𝚘𝚘𝚕𝚜𝚎𝚝 and progressive disclosure, your agent only spends context tokens on the exact patterns it needs at runtime.

Stop trying to cram complex and fragile instructions into a single system prompt. Break your workflows down, apply the right structural pattern, and build reliable agents.

Get started today

The Agent Skills specification is open-source and natively supported across ADK. You already know how to package the format. Now you know how to design the content. Go build smarter agents with Google Agent Development Kit.

每个ADK开发者都该知道的5种Agent Skill设计模式

Wed, 18 Mar 2026 00:00:00 +0000

来源：Google Cloud Tech on X
原作者：@Saboo_Shubham_ 和 @lavinigam

当谈到 SKILL.md 时，开发者往往执着于格式——写对 YAML、整理目录结构、遵循规范。但目前已有超过 30 个 agent 工具（如 Claude Code、Gemini CLI 和 Cursor）采用了相同的布局，格式问题实际上已经解决了。

现在的挑战是内容设计。规范解释了如何打包一个 skill，但完全没有指导如何构建内部的逻辑。例如，一个封装 FastAPI 约定的 skill 与一个四步文档流水线的 skill 运作方式完全不同，尽管它们的 SKILL.md 文件看起来一模一样。

通过研究整个生态系统中 skill 的构建方式——从 Anthropic 的仓库到 Vercel 和 Google 的内部指南——发现了五种反复出现的设计模式，可以帮助开发者构建 agent。

本文将通过可运行的 ADK 代码逐一讲解：

Tool Wrapper（工具包装器）： 让 agent 成为任意库的即时专家
Generator（生成器）： 从可复用模板生成结构化文档
Reviewer（审查器）： 按检查清单对代码评分（按严重程度）
Inversion（反转）： agent 先访谈用户再行动
Pipeline（流水线）： 强制执行带检查点的多步骤工作流

模式一：Tool Wrapper（工具包装器）

Tool Wrapper 为 agent 提供按需获取特定库上下文的能力

与其将 API 约定硬编码到系统提示词中，不如将它们打包成一个 skill。Agent 只在实际使用该技术时才会加载这些上下文。

这是最简单的实现模式。SKILL.md 文件监听用户提示词中的特定库关键词，从 references/ 目录动态加载内部文档，并将这些规则作为绝对真理应用。这正是将团队内部编码规范或特定框架最佳实践直接分发到开发者工作流中的机制。

下面是一个教 agent 如何编写 FastAPI 代码的 Tool Wrapper 示例。注意指令如何明确告诉 agent 仅在开始审查或编写代码时才加载 conventions.md 文件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# skills/api-expert/SKILL.md
---
name: api-expert
description: FastAPI 开发最佳实践和约定。当构建、审查或调试 FastAPI 应用、REST API 或 Pydantic 模型时使用。
metadata:
 pattern: tool-wrapper
 domain: fastapi
---

你是 FastAPI 开发专家。将这些约定应用到用户的代码或问题中。

## 核心约定

加载 'references/conventions.md' 获取完整的 FastAPI 最佳实践列表。

## 审查代码时
1. 加载约定参考
2. 检查用户代码是否符合每条约定
3. 对于每个违规，引用具体规则并建议修复方法

## 编写代码时
1. 加载约定参考
2. 严格遵循每条约定
3. 为所有函数签名添加类型注解
4. 使用 Annotated 风格进行依赖注入

模式二：Generator（生成器）

Tool Wrapper 应用知识

而 Generator 则强制一致的输出。如果你苦恼于 agent 每次运行时生成不同的文档结构，Generator 通过编排填空过程来解决这个问题。

它利用两个可选目录：assets/ 存放输出模板，references/ 存放样式指南。指令充当项目经理，告诉 agent 加载模板、阅读样式指南、询问用户缺失的变量，然后填充文档。这对于生成可预测的 API 文档、标准化提交信息或脚手架项目架构都很实用。

在这个技术报告生成器示例中，skill 文件不包含实际的布局或语法规则。它只是协调这些资产的获取，并强制 agent 逐步执行：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# skills/report-generator/SKILL.md
---
name: report-generator
description: 生成 Markdown 格式的结构化技术报告。当用户要求撰写、创建或起草报告、摘要或分析文档时使用。
metadata:
 pattern: generator
 output-format: markdown
---

你是一个技术报告生成器。严格按以下步骤执行：

步骤1：加载 'references/style-guide.md' 获取语气和格式规则。

步骤2：加载 'assets/report-template.md' 获取所需的输出结构。

步骤3：询问用户填写模板所需的任何缺失信息：
- 主题或议题
- 主要发现或数据点
- 目标受众（技术型、执行层通用型）

步骤4：按照样式指南规则填充模板。模板中的每个部分都必须出现在输出中。

步骤5：返回完成的报告作为单个 Markdown 文档。

模式三：Reviewer（审查器）

Reviewer 模式将"检查什么"与"如何检查"分离。与其编写一个详细说明每种代码异味的冗长系统提示词，不如将模块化评分标准存储在 references/review-checklist.md 文件中。

当用户提交代码时，agent 加载这份检查清单，系统地对提交内容进行评分，按严重程度分组发现。如果你将 Python 风格检查清单换成 OWASP 安全检查清单，你就得到了一个使用完全相同 skill 基础设施的专门审计。这是一种有效自动化 PR 审查或在人工查看代码之前捕捉漏洞的方式。

以下代码审查 skill 演示了这种分离。指令保持静态，但 agent 动态地从外部检查清单加载特定的审查标准，并强制产生基于严重程度的结构化输出：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# skills/code-reviewer/SKILL.md
---
name: code-reviewer
description: 审查 Python 代码的质量、风格和常见 bug。当用户提交代码供审查、请求代码反馈或想要代码审计时使用。
metadata:
 pattern: reviewer
 severity-levels: error,warning,info
---

你是一个 Python 代码审查员。严格遵循以下审查流程：

步骤1：加载 'references/review-checklist.md' 获取完整的审查标准。

步骤2：仔细阅读用户的代码。在批评之前先理解其目的。

步骤3：将检查清单中的每条规则应用到代码上。对于发现的每个违规：
- 记录行号（或大致位置）
- 分类严重程度：error（必须修复）、warning（应该修复）、info（可以考虑）
- 解释为什么这是个问题，而不仅仅说是什么问题
- 提供带有修正代码的具体修复建议

步骤4：生成带有以下部分的结构化审查：
- **摘要**：代码的功能、整体质量评估
- **发现**：按严重程度分组（error 优先，然后是 warning，然后是 info）
- **评分**：1-10 分并附上简要理由
- **前三条建议**：最有影响力的改进

模式四：Inversion（反转）

Agent 天生想要立即猜测和生成。Inversion 模式反转了这种动态。不是由用户驱动提示词、agent 执行，而是让 agent 充当面试官。

Inversion 依赖明确的、不可协商的门控指令（如"在所有阶段完成之前不要开始构建"），强制 agent 先收集上下文。它按顺序提出结构化问题，并等待你的答案才进入下一阶段。在没有获得需求和部署约束的完整画面之前，agent 拒绝综合最终输出。

查看这个项目规划 skill 的实际效果。关键元素是严格的阶段划分和明确的门控提示词，这些阻止 agent 在收集完所有用户答案之前综合最终计划：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# skills/project-planner/SKILL.md
---
name: project-planner
description: 通过结构化问题收集需求来规划新软件项目。当用户说"我想构建"、"帮我规划"、"设计一个系统"或"启动一个新项目"时使用。
metadata:
 pattern: inversion
 interaction: multi-turn
---

你正在进行结构化的需求访谈。在所有阶段完成之前，不要开始构建或设计。

## 第一阶段 — 问题发现（一次问一个问题，等待每个答案）

按顺序提出这些问题，不要跳过任何问题。

- Q1："这个项目为用户解决什么问题？"
- Q2："主要用户是谁？他们的技术水平如何？"
- Q3："预期规模是多少？（每日用户数、数据量、请求速率）"

## 第二阶段 — 技术约束（仅在第一阶段完全回答后）

- Q4："你将使用什么部署环境？"
- Q5："你有任何技术栈要求或偏好？"
- Q6："哪些需求是不可妥协的？（延迟、正常运行时间、合规性、预算）"

## 第三阶段 — 综合（仅在所有问题回答后）

1. 加载 'assets/plan-template.md' 获取输出格式
2. 使用收集到的需求填充模板的每个部分
3. 向用户展示完成的计划
4. 询问："这个计划准确捕捉了你的需求吗？你想改变什么？"
5. 根据反馈迭代，直到用户确认

模式五：Pipeline（流水线）

对于复杂任务，你不能承受跳过步骤或忽略指令。Pipeline 模式强制执行严格的顺序工作流，并带有硬性检查点。

指令本身充当工作流定义。通过实现明确的菱形门控条件（如"在进行文档字符串生成到最终组装之前需要用户批准"），Pipeline 确保 agent 不能绕过复杂任务并呈现未经验证的最终结果。

此模式利用所有可选目录，仅在需要它们的特定步骤才拉取不同的参考文件和模板，保持上下文窗口整洁。

在这个文档流水线示例中，注意明确的门控条件。在上一步用户确认生成的文档字符串之前，agent 被明确禁止进入组装阶段：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


# skills/doc-pipeline/SKILL.md
---
name: doc-pipeline
description: 通过多步流水线从 Python 源代码生成 API 文档。当用户要求为模块添加文档、生成 API 文档或从代码创建文档时使用。
metadata:
 pattern: pipeline
 steps: "4"
---

你正在运行文档生成流水线。按顺序执行每个步骤。不要跳过步骤或在前一步失败时继续。

## 步骤 1 — 解析和清单

分析用户的 Python 代码，提取所有公共类、函数和常量。将清单作为检查列表呈现。询问："这是你想文档化的完整公共 API 吗？"

## 步骤 2 — 生成文档字符串

对于每个缺少文档字符串的函数：
- 加载 'references/docstring-style.md' 获取所需格式
- 严格按照样式指南生成文档字符串
- 展示每个生成的文档字符串供用户批准
在用户确认之前不要进入步骤 3。

## 步骤 3 — 组装文档

加载 'assets/api-doc-template.md' 获取输出结构。将所有类、函数和文档字符串编译成单个 API 参考文档。

## 步骤 4 — 质量检查

对照 'references/quality-checklist.md' 进行审查：
- 每个公共符号都有文档
- 每个参数都有类型和描述
- 每个函数至少有一个使用示例
报告结果。在呈现最终文档之前修复问题。

选择正确的 Agent Skill 模式

每个模式回答不同的问题。用这个决策树找到适合你用例的模式：

场景	推荐的模式
赋予 agent 特定库/框架的知识	Tool Wrapper
需要一致的結構化文档输出	Generator
代码审查 / 内容审计	Reviewer
行动前先收集需求	Inversion
强制执行严格的多步骤工作流	Pipeline

最后，模式可以组合

这些模式不是互斥的，它们可以组合。

最后，模式可以组合

这些模式不是互斥的，它们可以组合。

Pipeline skill 可以在最后包含一个 Reviewer 步骤来双重检查自己的工作。Generator 可以在最开始依赖 Inversion 来收集填充模板前所需的变量。多亏了 ADK 的 SkillToolset 和渐进式披露，你的 agent 只在运行时在精确需要的模式上花费上下文令牌。

不要再试图将复杂而脆弱的指令塞进单个系统提示词中了。分解你的工作流，应用正确的结构模式，构建可靠的 agent。

今天就开始

Agent Skills 规范是开源的，并在 ADK 中原生支持。你已经知道如何打包格式了。现在你知道了如何设计内容。用 Google Agent Development Kit 构建更智能的 agent。