让 Agent 的技能活起来：Skill 自进化系统实践

> 一份关于如何让 AI Agent 的技能系统"活起来"的工程实践记录。

一份关于如何让 AI Agent 的技能系统"活起来"的工程实践记录。基于 SkillRL (arXiv 2602.08234) 等前沿研究，在无 GPU 的 VPS 上落地的轻量方案。

TL;DR

我们构建了一套完整的 Agent Skill 生命周期管理系统：

56 个技能，每个有结构化的 principle / when_to_apply / common_mistakes
语义检索：embedding 向量匹配，不靠关键词猜
失败闭环：用错 skill → 自动记录 → 自动降权 → 自动触发进化
反思系统：任务结束 → 自动判断是否需要反思 → 提取模式 → 下次避坑
SkillGraph：DAG 路径搜索 + SkillBank 字段联动，支持组合推荐

核心代码 ~4500 行 Python + 160 行 Bash，全部 TDD，50+ 测试全绿。

1. 问题：Agent 的技能是"死"的

大多数 Agent 框架的 skill 系统是只读库：

用户请求 → 匹配 skill → 加载到 prompt → 执行 → 结束

没有反馈循环。用对了不知道为什么对，用错了不知道为什么错。Skill 的质量完全依赖人工维护。

我们的目标是让这个循环闭合：

用户请求 → 匹配 skill → 执行 → 成功/失败 → 记录结果 → 更新 skill 质量 → 下次更好的匹配

2. 系统架构

┌─────────────────────────────────────────────────────┐
│                  Hermes Agent                        │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │ Skill     │  │ Skill    │  │ Unified          │  │
│  │ Index     │  │ Graph    │  │ Reflection       │  │
│  │ (搜索)    │  │ (组合)   │  │ (反思)           │  │
│  └─────┬────┘  └────┬─────┘  └────────┬─────────┘  │
│        │            │                  │             │
│        ▼            ▼                  ▼             │
│  ┌─────────────────────────────────────────────┐    │
│  │           skill-index.json                   │    │
│  │  (56 skills, 384d embeddings, 质量评分)      │    │
│  └─────────────────────────────────────────────┘    │
│        │            │                  │             │
│        ▼            ▼                  ▼             │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │ .usage.json│  │failure_  │  │ compound.sh      │  │
│  │ (使用统计) │  │log.jsonl │  │ (任务后反思)     │  │
│  └──────────┘  └──────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────┘

核心组件

组件	文件	职责
Skill Index	`skill_index.py` (887行)	Embedding 语义搜索 + 质量评分 + 索引构建
Skill Graph	`skill_graph.py` (421行)	DAG 路径搜索 + SkillBank 字段联动
Skill Evolution	`skill_evolution.py` (925行)	技能自进化工具集 (7 tools)
Skill Ranking	`skill_ranking.py` (198行)	Thompson Sampling skill 选择
Skill Discovery	`skill_discovery.py` (655行)	N-gram 轨迹分析 + 自动生成 SKILL.md
Skill Usage	`skill_usage.py` (900行)	使用统计 + record_outcome()
Unified Reflection	`unified_reflection.py` (582行)	统一反思模块 (事件记录 + 模式提取 + 建议检索)
Compound System	`compound.sh` (161行)	任务后反思 shell 入口 + auto-evolve

3. SkillBank 结构化：从自由文本到结构字段

受 SkillRL 论文启发，我们给每个 skill 增加了三个结构化字段：

{
  "name": "code-review",
  "description": "代码审查技能",
  "principle": "先理解意图再审查实现，关注正确性和可维护性",
  "when_to_apply": "When reviewing code, writing PRs, or checking quality",
  "common_mistakes": [
    "只看语法不看逻辑",
    "忽略边界条件",
    "不检查测试覆盖"
  ]
}

这三字段的来源是从 SKILL.md 正文中自动提取：

def _extract_principle(body: str) -> str:
    """从 Steps / How to / 正文首段提取核心方法"""
    # 优先匹配 Steps 下的第一条
    # 回退到 How to 段落
    # 最后取正文首段
 
def _extract_when_to_apply(body: str) -> str:
    """从 When to use / Triggers / When to load 提取"""
 
def _extract_common_mistakes(body: str) -> list[str]:
    """从 Notes/Pitfalls/Caveats/Warnings 提取"""

提取率：56 个 skills 中，44 有 principle (79%)，31 有 common_mistakes (55%)，23 有 when_to_apply (41%)。

为什么这很重要

SkillGraph 联动：suggest_composition() 现在不仅匹配 provides，还匹配 when_to_apply
推荐增强：find_paths() 返回结果包含 principle 和 common_mistakes 作为警告
RL 训练基础：这些结构化字段是未来 GRPO 训练的 reward signal 基础

4. 语义检索：Embedding 不靠猜

skill_index.py 使用 paraphrase-multilingual-MiniLM-L12-v2 (384维) 做语义搜索：

# 构建索引时：每个 skill 生成 embedding
text = f"{name} {description} {tags} {triggers} {body_preview}"
embedding = model.encode(text, normalize_embeddings=True)
 
# 搜索时：query embedding vs 所有 skill embeddings
scores = cosine_similarity(query_embedding, all_embeddings)
top_k = sorted(scores, reverse=True)[:limit]

降级策略：sentence-transformers 不可用时，自动降级到 hash-based embedding（MD5 分桶），保证功能可用。

实际效果：

"deploy" → 匹配 server-operations (score 0.82)
"debug" → 匹配 debugging-toolkit (score 0.79)
"review" → 匹配 code-review (score 0.85)

5. 失败闭环：用错就记，记了就改

5.1 使用追踪

每次 skill 被使用，_update_skill_usage() 更新 .usage.json：

{
  "code-review": {
    "use_count": 12,
    "total_outcomes": 10,
    "success_rate": 0.7
  }
}

5.2 失败驱动进化

def _trigger_skill_evolution(event):
    """连续低成功率 → 标记为待审查"""
    if success_rate < 0.3 and total > 5:
        _flag_for_review(skill_name, entry)
    if "not found" in error_msg:
        _log_discovery_trigger(event)  # 触发技能发现

5.3 Auto-Evolve（新）

compound.sh reflect 现在自动检测 skill 相关任务：

# 从 files 参数提取 skill 名
skills_from_path=$(echo "$files" | grep -oP 'skills/\K[^/]+' | sort -u)
skills_from_md=$(echo "$files" | grep -oP '([^/]+)/SKILL\.md' | ...)
 
# 自动调用 evolve
for skill_name in $all_skills; do
    python3 unified_reflection.py evolve "$skill_name" "$outcome"
done

效果：compound.sh reflect error_recovered high 0 0 "skills/code-review/SKILL.md" 自动将 code-review 标记为成功使用。