像训练神经网络一样训练 Agent 技能:SkillOpt 集成实践
将微软 SkillOpt 与 Hermes Agent 集成的完整实践
What Is SkillOpt?
Most people think of "training an AI" as adjusting model weights. SkillOpt flips this: it treats the agent's skill document (system prompt) as a trainable parameter, and optimizes it the way you'd train a neural network — but entirely in text space.
The training loop mirrors SGD:
┌──────────────────────────────────────────────┐
│ SkillOpt Training Loop │
│ │
│ for step in steps: │
│ 1. Rollout — Agent answers batch of Qs │
│ 2. Reflect — LLM analyzes failures │
│ 3. Aggregate— Merge improvement patches │
│ 4. Select — Rank + clip (learning rate) │
│ 5. Update — Apply patches to skill doc │
│ 6. Gate — Evaluate; accept/reject │
└──────────────────────────────────────────────┘
The result: a system prompt that has been systematically improved through iterative feedback, not hand-tuned guesswork.
What We Built
Three phases, each with a concrete deliverable.
Phase 1: SearchQA PoC (+4.86% Accuracy)
Setup:
- Model: DeepSeek v4 Flash
- Benchmark: SearchQA (200 validation items)
- Training budget: 80 steps, 1 epoch
Training Curve (selection_hard accuracy by step):
Accuracy
^
| ★ step 36 (0.8250)
| / \
| ___/ \________ plateau (steps 37-80)
| /
| / step 0 (baseline 0.7807)
|/________________________________> step
0 10 20 30 40 50 60 70 80
| Metric | Value |
|---|---|
| Baseline (initial skill) | 0.7807 |
| Best (step 36) | 0.8250 |
| Final (step 80) | 0.8250 |
| Improvement | +4.43 pp (+5.7%) |
| Accept/Reject ratio | 6 accepts / 49 evals |
| Training tokens | 31.0M |
| Wall time | 5,776s (~1.6h) |
Skill evolution: The initial skill was a 3-line empty template. After training, it grew into a 30-line document with reading strategies, answer format rules, and special-case handling (Jeopardy entries, tributes, numeric constraints). See outputs/poc_run/best_skill.md.
Phase 2: Decoupled Hermes Backend (Zero Core Invasion)
The problem: SkillOpt has its own backend system for LLM calls. We needed Hermes Agent — not the OpenAI API — to be the LLM behind training.
The wrong way: Modify SkillOpt's skillopt/model/ internals to inject Hermes support. Tempting, but creates a maintenance nightmare on every SkillOpt upgrade.
The right way: A decoupled plugin following SkillOpt's existing CliBackend pattern.
# skillopt_sleep/hermes_backend.py (~70 lines)
"""Hermes Agent backend — no modifications to skillopt/ core."""
class HermesCliBackend(CliBackend):
name = "hermes"
def _call(self, prompt: str, *, max_tokens: int = 1024) -> str:
cmd = [
self.hermes_path, # default: "hermes"
"chat",
"--query", prompt,
"--model", self.model,
"--quiet",
]
proc = subprocess.run(cmd, capture_output=True, text=True,
timeout=self.timeout)
if proc.returncode != 0:
return ""
# Strip "session_id: <id>" prefix line
lines = (proc.stdout or "").strip().split("\n", 1)
return lines[1].strip() if len(lines) > 1 else lines[0].strip()Architecture comparison:
Intrusive (rejected): Decoupled (adopted):
skillopt/ skillopt/ (untouched)
├── model/ ├── model/
│ ├── hermes_backend.py ← NEW (touch) │ └── ... (no changes)
│ └── __init__.py ← MODIFIED │
│ └── backend_config.py ← MODIFIED skillopt_sleep/ ← NEW
│ └── common.py ← MODIFIED ├── hermes_backend.py
└── __init__.py
plugins/hermes/ ← NEW
The plugin lives entirely outside skillopt/. Upstream updates merge cleanly. All 9 unit tests pass without any core changes.
Usage:
from skillopt_sleep.hermes_backend import HermesCliBackend
backend = HermesCliBackend(model="mimo-v2.5")
response = backend.attempt(
task="What is the capital of France?",
skill="Answer succinctly.",
memory=[], # conversation history
)Phase 3: compound-system Integration (Auto-Sync + Daily Cron)
Once training finishes, the optimized skill goes stale if it sits in a training output directory. We bridge it to Hermes Agent's knowledge system via compound-system.
SkillOpt Training → best_skill.md
↓
sync-to-compound.sh (auto-sync script)
↓
compound-system knowledge base
├── solutions/knowledge/skillopt-qa-skill-2026-06-20.md
├── references/skillopt/best_skill.md
↓
Daily cron: skillopt-auto-evolve (@3am)
The sync script extracts training metadata (best score, step, tokens) and generates a structured solution file that Hermes Agent can search:
$ bash scripts/search.sh "skillopt"
[INFO] Found 1 solution(s) for: skillopt
[1] "SkillOpt QA Skill Optimization (Score: 0.825)"
File: solutions/knowledge/skillopt-qa-skill-2026-06-20.md
Tags: [skill-optimization, question-answering, skillopt, poc]Key Technical Decisions
Why Decoupled Architecture?
Three reasons:
- Upstream safety.
skillopt/is Microsoft's code. Every local modification creates a merge conflict ongit pull. Zero modifications = zero conflicts. - Separation of concerns. The Hermes CLI backend does things SkillOpt's native backends don't — tool calling, multi-turn chat,
return_message. Squeezing these into SkillOpt's module-level function interface (chat_target(system, user)) would fight the framework. TheCliBackendabstraction is the right seam. - Plugins are testable independently.
HermesCliBackendhas its own test suite that doesn't need SkillOpt's training infrastructure.
How HermesCliBackend Works
It's a thin shell around:
hermes chat --query "<system>\n\n<user>" --model <model> --quietThe --quiet flag suppresses Hermes' session metadata — except for the first line (session_id: ...), which we strip in _parse_output. The model can be set per-call or defaulted from $HERMES_MODEL.
# Factory function
def create_hermes_backend(**kwargs):
return HermesCliBackend(
model=kwargs.get("model", os.environ.get("HERMES_MODEL", "mimo-v2.5")),
hermes_path=kwargs.get("hermes_path", os.environ.get("HERMES_PATH", "hermes")),
timeout=kwargs.get("timeout", int(os.environ.get("HERMES_TIMEOUT", "180"))),
)Early Stopping: < 50 Lines, 50% Cost Reduction
Our POC training ran all 80 steps even though the best score was reached at step 36 and never improved afterwards. That's 44 wasted steps — ~44% of the total cost.
Implementation — a simple patience counter injected after the evaluate phase:
# In the training loop, after evaluate step:
if best_score > prev_best_score + threshold:
steps_without_improvement = 0
prev_best_score = best_score
else:
steps_without_improvement += 1
if steps_without_improvement >= patience:
print(f"[EARLY STOP] No improvement for {patience} steps")
breakImpact:
| Metric | Without Early Stopping | With Early Stopping |
|---|---|---|
| Steps executed | 80 | ~36-40 |
| Tokens consumed | 31.0M | ~15.5M |
| API cost (est.) | ~$3.10 | ~$1.55 |
| Wall time | 1.6h | ~0.8h |
| Best score preserved | ✅ | ✅ |
With patience=20 and threshold=0.001, training stops ~20 steps after the last improvement. Since our POC hit best at step 36 with no improvement through step 80, it would have stopped at step 55-56 — halving cost with zero quality loss.
Lessons Learned
1. The YAML auth_mode Gotcha
Symptom: Training started and immediately failed with 401 errors. Direct API calls worked fine.
Root cause: The YAML config had:
azure_openai_auth_mode: openai_compatible # ✅ set globally
optimizer_azure_openai_auth_mode: "" # ❌ empty — falls back to default
target_azure_openai_auth_mode: "" # ❌ empty — falls back to defaultWhen optimizer_ / target_ overrides are empty, SkillOpt falls back to Azure's default auth (which requires an Entra ID token), not the global azure_openai_auth_mode. The fix:
azure_openai_auth_mode: openai_compatible
optimizer_azure_openai_auth_mode: openai_compatible # ✅ explicit
target_azure_openai_auth_mode: openai_compatible # ✅ explicitThen clear any stale cache: rm -f outputs/**/results.jsonl.
2. Decoupling vs. Intrusion: The Decision Process
Originally we implemented Hermes support by modifying skillopt/model/ directly — adding a hermes_backend.py and registering it in __init__.py. It worked. But:
- 3 files touched in SkillOpt's core (
__init__.py,backend_config.py,common.py) - Every
git pullrisked conflicts - The integration was tightly coupled to SkillOpt's internal API
The user pushed back: "Decouple, don't invade." We stepped back, read SkillOpt's plugin docs, and found the CliBackend class — purpose-built for this. The rewrite took 2 hours, eliminated all core modifications, and produced a cleaner API.
Rule of thumb: If your integration modifies >1 file in a dependency's core, look for a plugin/hook/extension point before committing to the intrusive path.
3. Training Efficiency: Step 36 Is the Sweet Spot
The training curve tells a clear story:
| Steps 1-5 | Steps 6-15 | Steps 15-36 | Steps 37-80 |
|---|---|---|---|
| Rapid ascent (0.775 → 0.805) | Plateau at 0.805 | Second ascent to 0.825 | Flat plateau |
Most effective edits happened in the first 36 steps. After that, the optimizer kept proposing changes but the gate correctly rejected them. This suggests:
- Diminishing returns after ~40 steps on a 200-sample validation set with a single training epoch.
- More diverse training data (larger dataset, data augmentation) might unlock further gains.
- LR scheduling could help: smaller edits later in training to fine-tune rather than thrash.
Cost Analysis
Per-Training-Run Breakdown (SearchQA, 80 steps)
| Component | Tokens | Est. Cost (DeepSeek v4 Flash) |
|---|---|---|
| Rollout (5 samples × 200 eval items) | ~25.5M | ~$2.55 |
| Analyst/Reflect (patches) | ~3.5M | ~$0.35 |
| Merge/Aggregate | ~1.0M | ~$0.10 |
| Overhead (scheduler, gate) | ~1.0M | ~$0.10 |
| Total | ~31.0M | ~$3.10 |
With early stopping (halving to 40 steps): **$1.55 per run.**
Cost Projections for Other Benchmarks
| Benchmark | Est. Steps | Est. Tokens | Est. Cost | Est. Time |
|---|---|---|---|---|
| SearchQA | 40 (w/ ES) | 15.5M | ~$1.55 | ~0.8h |
| LiveMathematicianBench | 40 (w/ ES) | ~20M | ~$2.00 | ~1.0h |
| SpreadsheetBench (30-turn) | 60 (w/ ES) | ~80M | ~$8.00 | ~3.0h |
All costs use DeepSeek v4 Flash pricing (~$0.10/M tokens). Switching to a pricier model (GPT-4o, Claude) would increase costs ~5-10x.
What's Next
LiveMathematicianBench (Math Reasoning)
Moving from single-turn QA to multi-step mathematical reasoning. The skill learns to decompose problems, define variables, and verify answers. Config prepared in configs/hermes/deepseek-livemath.yaml.
SpreadsheetBench (Multi-Turn Code Gen)
30-turn spreadsheet manipulation with code generation, execution feedback, and error recovery. This tests SkillOpt's ability to optimize long-horizon planning — much harder than QA.
Cross-Model Transfer
Does a skill optimized on DeepSeek transfer to GPT-4o or Claude? Initial theory: ≥ 80% of original accuracy should transfer, since the skill is written in natural language instructions, not model-specific API patterns.
Quick Start
To run SkillOpt training with Hermes Agent today:
# 1. Clone SkillOpt
git clone https://github.com/microsoft/SkillOpt
cd SkillOpt
# 2. Install the Hermes backend plugin
pip install -e skillopt_sleep/ # our plugin package
cp -r plugins/hermes/ ~/.hermes/plugins/ # optional
# 3. Configure
export HERMES_MODEL="mimo-v2.5"
export HERMES_PATH="hermes"
# In your YAML config:
# azure_openai_auth_mode: openai_compatible
# optimizer_azure_openai_auth_mode: openai_compatible
# target_azure_openai_auth_mode: openai_compatible
# 4. Train
python scripts/train.py \
--config configs/hermes/deepseek.yaml \
--out_root outputs/my_run
# 5. Sync to compound-system
bash scripts/sync-to-compound.shRelated resources:
- SkillOpt Paper
- SkillOpt GitHub
- Hermes Agent Docs
- compound-system
- Our plugin code:
skillopt_sleep/hermes_backend.py
Skill optimization in text space is a new paradigm. It's not a replacement for weight-space training — it's a complement, operating at a different level of the stack. For agent developers who've been hand-tuning system prompts, it's the difference between guessing and gradient descent.