04-DSL编译层

Part 4 · DSL 编译层

DSL（领域特定语言）是 LLM 与 SQL 之间的中间层。本章讲为什么不让 LLM 直接写 SQL、DSL 怎么设计、自动行为机制怎么实现、错误自救怎么做。

4.1 为什么不让 LLM 直接写 SQL

让 LLM 写 SQL 看起来最简单——不需要中间层。但实际跑下来这条路走不通。

问题 1：成本

LLM 写一条复杂 SQL 平均要 800-1500 tokens（含 JOIN、子查询、CASE WHEN）；
DSL JSON 表达同样意图 200-500 tokens；
每次错了重试 1500 tokens × N 次，成本飙升。

问题 2：正确率

SQL 是过程式 + 自由格式，LLM 错的方式有无数种：JOIN 关系错、GROUP BY 漏列、CASE WHEN 嵌套错、聚合函数误用、字段名拼错；
DSL 是声明式 + 结构化，LLM 错的方式有限，且能在编译器层捕获并给出明确报错。

cost-query 实测：

形态	单次正确率	平均重试次数
让 LLM 直接写 SQL	~50%	2.3 次
让 LLM 写 DSL JSON	~75%	1.4 次
让 LLM 用二级指令模板	~92%	0.4 次

问题 3：安全

SQL 注入风险：用户问题里的特殊字符（引号、分号、—）拼进 SQL；
编译器层能在 DSL → SQL 翻译时做参数化处理，杜绝注入。

问题 4：可审计

SQL 是终态产物，看不出”为什么是这条 SQL”；
DSL 是意图表达，能在日志里看到”用户问什么 → LLM 生成什么 DSL → 编译成什么 SQL”完整链路；
Bad case 复盘时这条链路是命根子。

问题 5：演进成本

编译器层引入新能力（如自动 JOIN、口径兜底、错误自救建议），让所有 LLM 调用者透明享受；
让 LLM 直接写 SQL 的话，每改进一处都要更新 prompt——prompt 越改越长，最终爆炸。

4.2 DSL 三个 verb

cost-query 的 DSL 提供 3 个动词，覆盖所有查询场景：

`find`（行级查询）

{
  "cube": "<主 Cube>",
  "dimensions": ["<Cube>.<field>", ...],   // 必填，至少 1 个
  "filters": [...],                         // 可选
  "order": [...],                           // 可选
  "limit": 20,                              // 可选，默认 20，上限 500
  "offset": 0
}

不允许 measures / rankBy / having；
measure 字段可以放到 dimensions（行级 SELECT，不聚合）；
用途：明细查询（“列出广东省的项目”、“列出某项目的清单条目”）。

`aggregate`（聚合查询，GROUP BY）

{
  "cube": "<主 Cube>",
  "dimensions": [
    "<Cube>.<field>",
    {"member": "<Cube>.<timeField>", "granularity": "month", "alias": "<可选>"}
  ],
  "measures": [
    {"member": "<Cube>.<field>", "agg": "sum|avg|min|max|count|distinctCount", "alias": "<可选>"},
    {"member": "<Cube>.<field>", "agg": "avg", "filterIf": [...]},
    {"alias": "<x>", "calc": "${alias1} / ${alias2}"}
  ],
  "filters": [...],
  "having": [
    {"member": "<已声明的 alias>", "operator": "gte|between|...", "values": [...]}
  ],
  "order": [...],
  "limit": 20,
  "params": {"indicatorType": 1}   // routed measure 需要
}

必填 measures（至少 1 个）；
支持扩展能力：HAVING / 时间分桶 / 计算度量 / filterIf；
用途：聚合查询（“按城市平均某指标”、“某项目按月成交均价趋势”）。

`rank`（Top-N 排序）

{
  "cube": "<主 Cube>",
  "rankBy": {"member": "<Cube>.<field>", "dir": "desc"},
  "limit": 10,
  "dimensions": [...],
  "measures": [...],
  "filters": [...]
}

rankBy / limit 必填；
用途：Top-N 查询。

cost-query 实际把 rank 下线了（统一改成 aggregate + orderBy + limit），因为：

rank 与 aggregate 语义大量重叠；
LLM 在 rank vs aggregate 之间选择经常出错；
砍一个 verb 让模型决策树缩短一层。

给你的领域的建议：MVP 阶段只做 find + aggregate 两个 verb，砍掉 rank。等真有 Top-N 场景多到无法用 aggregate 兜住，再加。

4.3 DSL 顶层字段约定

DSL 顶层只接受 10 个合法字段：

字段	出现位置	说明
`cube`	所有 verb	主 cube
`dimensions`	find / aggregate / rank	分组维度（或 find 的 SELECT 列）
`measures`	aggregate / rank	聚合度量
`filters`	所有 verb	WHERE 条件
`order`	所有 verb	ORDER BY
`having`	aggregate	HAVING（必须引用 measures 中声明的 alias）
`limit`	所有 verb	LIMIT
`offset`	所有 verb	OFFSET（分页用）
`params`	aggregate	routed measure 路由参数
`rankBy`	rank	rank 的核心字段

严格白名单：unknown key 直接报错，附 difflib 拼写推断（阈值 0.7）——如 groupBy → 提示”请用 dimensions”，where → 提示”请用 filters”。

Filter 结构

{"member": "<Cube>.<field>", "operator": "<op>", "values": [...]}

operator 集合：

op	语义
`equals` / `in`	等于 / 在集合内
`notEquals` / `notIn`	不等于 / 不在集合内
`contains` / `notContains`	包含 / 不包含
`startsWith` / `endsWith`	前缀 / 后缀
`gt` / `gte` / `lt` / `lte`	大于 / 大于等于 / 小于 / 小于等于
`between`	区间（values 长度=2）
`set` / `notSet`	非空 / 为空

4.4 自动行为：让 DSL 写起来不痛苦

直接 DSL 写法的痛点：用户每次都要写完整 Cube.field、要手动加 isEndCost=1、要在 filter 里写 JOIN 关系。

cost-query 在编译器层引入 3 类自动行为，让 DSL 直查与二级指令体验一致。

4.4.1 `inferredFilters`：按规则推断 filter

ProjectIndicator:
  inferredFilters:
    - field: isJianAn
      type: rule
      rule_name: _infer_isJianAn
      hint: "已推断 isJianAn={value}"

_builder.py 中的 _infer_isJianAn 实现：

def _infer_isJianAn(filters, dimensions, order_by) -> int | None:
    """
    任意位置引用 productTypeName → isJianAn=1（业态级）
    全部未引用 → isJianAn=0（项目级）
    显式传 isJianAn → 跳过推断（用户优先）
    """
    explicit = _find_explicit_filter(filters, "isJianAn")
    if explicit is not None:
        return None  # 不覆盖用户显式值
    refs = _collect_refs(filters, dimensions, order_by)
    if "productTypeName" in refs:
        return 1
    return 0

用户体验：

# 用户不必写 isJianAn
cost-query agg-project-indicator --params '{
  "projectName": "X",
  "productTypeName": "住宅",
  "measureGroups": [{"indicatorType": "建面单方"}]
}'
# 编译期 hint：
# 💡 已推断 isJianAn=1（filters 引用 productTypeName）

4.4.2 `defaultFilters`：默认过滤 + 规则放开

语义不是”无脑兜底”,而是默认过滤 + 规则放开:

默认:不传也按末级统计(isEndCost=1),挡住父子重复累加。
放开:一旦用户「按科目」展开——unless 里的字段出现在 filters 或 dimensions ——这条兜底就自动不生效。因为此刻用户要看父子各级科目,再焊死末级就答非所问了。

ProjectIndicator:
  defaultFilters:
    - field: isEndCost
      value: 1
      unless: [bzItemName, bzItemCode]   # 被 filter 或 dimension 引用 → 放开末级口径
      hint: "默认 isEndCost=1（避免父子科目混算）"

实现关键:命中判断要走 _collect_refs(与 §4.4.1 同一套),dimension 引用也算引用,不能只看 filters——否则「按科目分组」时默认值仍被强加,等于没放开。

def apply_default_filters(filters, dimensions, order_by, cube_schema):
    refs = _collect_refs(filters, dimensions, order_by)   # filter / dimension / order 都算"引用"
    for default in cube_schema.get("defaultFilters", []):
        field = default["field"]
        unless = default.get("unless", [])
        if field in refs:
            continue  # 用户已显式查该字段，不覆盖
        if any(u in refs for u in unless):
            continue  # 按科目等维度展开 → 放开末级兜底
        filters.append({"member": field, "operator": "equals", "values": [default["value"]]})

4.4.3 `friendlyAlias`：业务化短名

BqUnitPrice:
  friendlyAlias:
    cityName: "City.areaName"       # 自动 JOIN City
    buildArea: "Project.buildArea"  # 自动 JOIN Project
    keyword: "{self}.keywords"      # 多列 OR contains

用户可以写：

{
  "cube": "BqUnitPrice",
  "filters": [
    {"member": "cityName", "operator": "contains", "values": ["深圳"]},
    {"member": "buildArea", "operator": "gte", "values": [100000]}
  ],
  "measures": [...]
}

编译器自动展开为：

{
  "cube": "BqUnitPrice",
  "filters": [
    {"member": "City.areaName", "operator": "contains", "values": ["深圳"]},
    {"member": "Project.buildArea", "operator": "gte", "values": [100000]}
  ],
  ...
}
# 并自动加 LEFT JOIN City、LEFT JOIN Project

实施细节：

friendlyAlias 解析在 member 不含 . 时触发；
含 . 已是 Cube.field 形式直接生效；
多个事实 cube 各自定义 friendlyAlias，互不干扰；
物理化字段（2026-05 把 cityName/provinceName 直接物化进事实表）后，friendlyAlias 可以简化为单 cube 自身引用，少一次 JOIN。

4.4.4 自动行为的设计哲学

为什么不在 SKILL.md 里教 LLM”记得加 isEndCost=1”？

prompt 长度有上限：每多一条规则 prompt 多 100-200 tokens，长期不可持续；
LLM 会忘：长上下文 / 改个措辞 / 用户挑战时，LLM 会跳过这些”规则提醒”；
业务规则应该是确定性的：能用代码确定的东西不应该交给概率模型。

自动行为的边界：

编译期能推断的（已有 filter / dimension 引用情况）→ 用 inferredFilters；
静态固定的（“默认查末级科目”）→ 用 defaultFilters；
简单字段名映射（cityName → City.areaName）→ 用 friendlyAlias；
复杂业务规则（需要查数据库判断）→ 不要做成自动行为，明文写进 SKILL.md。

4.5 错误自救机制

错误分 3 级

1. 语法错（DSL 顶层字段错 / JSON 格式错）
   → 报错信息列出 10 个合法顶层字段 + 拼写推断

2. 语义错（cube 不存在 / 字段不存在 / 类型不匹配 / verb 与字段冲突）
   → 报错信息列出该 cube 可用字段清单 + 跨 cube 同名字段提示

3. 运行错（SQL 执行失败 / 数据库连接错）
   → 报错信息原样透出 + 当前 SQL 摘要

错误自救的核心要素

每条错误必须满足：

错误指向清楚：是哪个字段、哪个值、哪一层错；
附带候选清单：让 LLM 直接从清单里挑下一个，不要让它自己猜；
附带操作建议：明确告诉 LLM “下一步该怎么做”。

cost-query 实例（未知字段错）：

DSL 错误: 未知字段「BqUnitPrice.cityName」

  ✗ BqUnitPrice 没有 cityName 字段
  💡 可能你想要：
     - City.cityName（走 join，City cube 自身字段）
     - 或写裸字段名 cityName（friendlyAlias 自动展开为 City.areaName）
  💡 BqUnitPrice 自身可用 dimensions：
     name, bqName, code, projectCostOwnerNames, businessStageEnum, ...
  💡 建议：直接写 cityName（不带 cube 前缀），让 friendlyAlias 帮你

LLM 看到这条错误，下一次写 DSL 就会直接用 cityName 而不是 BqUnitPrice.cityName。

自救刚性约束

cost-query 在 dsl-spec §7 给 LLM 立了自救刚性约束：

DSL 报错触发自救时必须按以下三步：

完整阅读报错文本里所有以 [编号] / ✗ 开头的行——每行对应一处独立错误，缺一不可；

打开 dsl-spec §7.x 对应小节，按错误首行关键词路由，通读小节所有对照表；

一次性修正所有错误后再重试——不允许只改最显眼的一处。

自救只允许 1 次重试。重试仍失败时停止试探，向用户反问数据或口径障碍。

为什么这么严：LLM 拿到 5 处报错时的本能是”只改一处试一下”——这会引发 4-5 次重试的串行试错，浪费时间且容易陷入死循环。强制”一次改完”让单次重试就能定型。

4.6 query.py 核心架构

入口（argparse 子命令）
  ├─ 一级：find / aggregate / rank
  │   └─ load_dsl_input → _compile_and_execute
  │
  ├─ 二级：N 个 <verb>-<entity> 子命令
  │   └─ execute_template
  │       ├─ load YAML（_registry）
  │       ├─ build_dsl（_builder：filters/measureGroups/自动行为展开）
  │       └─ _compile_and_execute
  │
  ├─ batch：cost-query batch → 逐子查询调上面两条路径
  │
  └─ 调试：list / <cmd> --info / <cmd> --dry-run / <cmd> --explain / --debug

_compile_and_execute 流程：

parse DSL
  → load schema
  → verb 分发（FindCompiler / AggregateCompiler / RankCompiler）
  → validate（字段存在 / 类型匹配 / verb 与字段一致）
  → applyInferredFilters
  → applyDefaultFilters
  → expandFriendlyAlias
  → buildSQL
  → cli.execute
  → format_output（Markdown / JSON）

编译器 vs 执行器分离

query.py（编译器） → cli.py（执行器）

query.py 只编译 DSL → SQL，不管 MySQL 连接；
cli.py 拿到 SQL 后连库执行 + 结果归一化；
两者解耦的好处：
- 单测编译器不需要起 MySQL；
- 切数据库引擎（MySQL → ClickHouse → Doris）只动 cli.py；
- cost-cli "SELECT ..." 应急直查可以绕过编译器。

4.7 fork cost-query 的具体步骤

按本指南推荐的「参考实现」策略，你需要把 cost-query 的代码原样 fork 后改 schema 和 commands。

Step 1：复制目录

# 在你的项目仓内
cp -r path/to/ontology-model/.claude/skills/cost-query .claude/skills/<your-skill>

Step 2：清理 cost-query 专属内容

cd .claude/skills/<your-skill>

# 删除业务专属的 schema / commands / references
rm scripts/schema.yaml
rm scripts/commands/*.yaml
rm scripts/data -rf      # cost-query 的卡片数据
rm references/schema/*
rm references/cost.ttl   # 你领域有自己的本体

Step 3：保留通用代码

# 保留的 Python 代码（核心编译器）
ls scripts/
# query.py          ← 保留
# _builder.py       ← 保留（按需调整自动行为规则）
# _registry.py      ← 保留
# cli.py            ← 改数据库连接字符串
# fetch_schema.py   ← 保留
# commands/_ext/    ← 保留（batch.py / find_dimension.py）

# 保留的文档（结构原样，内容重写）
ls references/
# dsl-spec.md       ← 保留（DSL 协议跨领域一致）
# query-guide.md    ← 重写（模式手册是你的领域专属）
# cli-mapping.md    ← 重写
# terminology.md    ← 重写

Step 4：写 schema.yaml

参考 scaffold/scripts-template/schema.yaml.template 起步，定义你领域的 cube。

Step 5：跑 fetch-schema 生成速查版

python scripts/fetch_schema.py
ls references/schema/   # 应该看到生成的 _index.yaml + 每个事实 cube 单文件

Step 6：写第一个二级指令

参考 scaffold/scripts-template/commands/find-entity.yaml.template。

Step 7：跑 hello world

cost-query <your-skill>-cli aggregate --dsl '{
  "cube": "<your-fact-cube>",
  "dimensions": ["<some-field>"],
  "measures": [{"member": "<some-measure>", "agg": "count", "alias": "cnt"}],
  "limit": 10
}'

跑通 = DSL 编译成功 + 数据库连接成功 + 拿到结果。

4.8 接下来读什么

要看二级指令怎么封装 → 读 Part 5 二级指令系统；
要看 DSL 协议完整字段表 → 读 ontology-model 仓 dsl-spec.md；
要直接看脚手架 → 跳到 Part 9 脚手架与启动模板。

Part 4 完。读完应能回答：“为什么不让 LLM 直接写 SQL、DSL 三个 verb 各自什么用、自动行为机制怎么实现、错误自救怎么设计、fork cost-query 的具体步骤”。