Golden 数据驱动的自动评测管线

> 自动评测是优化的基础——没有评测，优化就是盲人摸象。

自动评测是优化的基础——没有评测，优化就是盲人摸象。

1. Golden 数据驱动

data/eval/
├─ tickets_eval.csv          # 评测用工单 (query + metadata)
├─ golden_expectations.csv   # 期望结果 (intent + risk + severity)
└─ golden_sag.csv            # 期望检索结果 (query → relevant_doc_ids)

2. 评测维度

维度	指标	说明
Intent	accuracy, F1 per class	8 类意图分类准确率
Severity	accuracy, macro-F1	严重性分级 (LOW/MEDIUM/HIGH)
Risk	F1, precision, recall	6 类风险标签检测
SAG	recall@k, MRR	检索相关文档的召回率
No-auto-send	accuracy	不应自动发送的工单识别
Composite	weighted score	综合评分 (0-1)

3. 评测 → 优化闭环

Golden Data
    │
    ▼
Evaluate (intent + severity + risk + SAG)
    │
    ▼
Diagnose (哪些 case 失败？模式是什么？)
    │
    ├─→ Rule Fix (关键词/阈值调整)
    ├─→ ML Fix (FastText 重训练)
    ├─→ LLM Fix (规则变异建议)
    │
    ▼
Verify (全量回归测试)
    │
    ├─→ Pass → Apply
    └─→ Fail → Rollback
    │
    ▼
Record (评测结果持久化到 JSON)
    │
    └─→ 下一轮优化 (断点续传 --resume)

4. 全景数据流

┌──────────────────────────────────────────────────────────────┐
│                    用户请求                                    │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  SAG Retrieval (RAG)                                         │
│  ├─ Structural Recall (实体匹配 + 扩展)                       │
│  ├─ Vector Recall (embedding cosine)                         │
│  ├─ Fusion (both_boost)                                      │
│  └─ Thompson Sampling 排序                                   │
└──────────────────────┬───────────────────────────────────────┘
                       │ top-k skills
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Skill Injection (SAG)                                       │
│  ├─ principle → 行为指导                                     │
│  ├─ common_mistakes → 负向 guidance                          │
│  └─ when_to_apply → 上下文匹配                               │
└──────────────────────┬───────────────────────────────────────┘
                       │ augmented prompt
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  LLM Generation + Task Execution                            │
└──────────────────────┬───────────────────────────────────────┘
                       │ outcome
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Feedback Loop                                                │
│  ├─ .usage.json (success_rate 更新)                           │
│  ├─ failure_log.jsonl (失败记录)                              │
│  ├─ compound.sh reflect (反思判断)                            │
│  └─ skill evolution (失败驱动进化)                            │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Auto-Evaluation Pipeline                                     │
│  ├─ Golden data 匹配                                         │
│  ├─ Intent/Severity/Risk/SAG 评测                            │
│  ├─ Diagnose → Fix → Verify → Apply                          │
│  └─ 优化器 (Rules → FastText → NSGA-II → LLM)               │
└──────────────────────────────────────────────────────────────┘