grammar — constrained tool_call generationThe Metnos PLANNER picks which executor to call at each step.
It’s a local LLM (Gemma 4 26B). When the query is complex — like “propose
3 morning slots next week for an appointment with Silvia, after I pick one, email
me the choice” — the LLM sometimes enters a reasoning loop and ends
up emitting prose instead of a structured tool_call. The system
doesn’t know what to run. Game over.
This is called “thinking loop”. Symptoms observed:
{"name":"get_inputs","arguments":
{"kind":"choice","from_step":1,...}} — the name of one tool
with the arguments of another tool.request_new_executor
(i.e. “synthesize me a new verb”) even when there’s an obvious canonical
executor already. Or request_location_from_user for the query
“create folder /tmp/x” (no, I don’t need your GPS location).Bench before this decision: 50–70% convergence on 10 representative queries. The propose+notify pipeline from ADR 0129 was failing. Unacceptable for a personal assistant.
llama.cpp (the engine that runs Gemma) supports a feature called GBNF (Grammar Backus-Naur Form). It’s a language for writing formal grammars, like:
root ::= "{" ws "\"name\":" name ws "}"
name ::= "\"get_now\"" | "\"find_files\"" | "\"send_messages\""
ws ::= [ \t\n]*
When you pass a grammar to the server, the LLM physically cannot emit tokens that violate it. Every candidate token is filtered against the grammar before being chosen. It’s a hard rail, not a suggestion.
The ADR 0133 idea: at each PLANNER step, we generate a specific grammar from the step’s tools pool, and we pass it to the server. The model is free to pick which tool to call and which arguments to pass, but ONLY within what the schemas declare valid.
runtime/tool_grammar.py::generate_tool_grammar)Takes the pool of tools and returns a GBNF string. The grammar core is a discriminated union:
root ::= "{" ws "\"name\":" (pairGetInputs | pairSendMessages | ...) ws "}"
pairGetInputs ::= "\"get_inputs\"" sep "\"arguments\":" colon (argsGetInputs)
pairSendMessages ::= "\"send_messages\"" sep "\"arguments\":" colon (argsSendMessages)
...
The keyword is discriminated: if the LLM picks the pairGetInputs
branch, the args must conform to argsGetInputs. It cannot
mix the name of one tool with the args of another — which was the live bug
that blocked us for half a day.
For args, the generator reads the JSON Schema declared in the executor’s TOML manifest and translates it into GBNF rules. For nested cases (array of object, object inside object) the generator is recursive:
argsGetInputs ::= "{" ws propGetInputsTitle sep propGetInputsDialog
(sep (propGetInputsFromStep | ...))* ws "}"
propGetInputsDialog ::= "\"dialog\":" colon (
"[" ws (GetInputsObjD1I0 (sep GetInputsObjD1I0)*)? ws "]"
)
GetInputsObjD1I0 ::= "{" ws propGetInputsObjD1I0Var sep propGetInputsObjD1I0Prompt sep
propGetInputsObjD1I0Schema (sep (...))* ws "}"
propGetInputsObjD1I0Schema ::= "\"schema\":" colon (GetInputsObjD3I2)
GetInputsObjD3I2 ::= "{" ws propGetInputsObjD3I2Kind (sep (...))* ws "}"
propGetInputsObjD3I2Kind ::= "\"kind\":" colon (
"\"text\"" | "\"credentials\"" | "\"yes_no\"" | "\"choice\"" |
"\"choice_with_preview\"" | "\"multi_choice\"" | "\"number\"" |
"\"date\"" | "\"file_path\"" | "\"location\""
)
Real example for get_inputs: the schema declares dialog
as an array of object, each {var, prompt, schema}, where
schema.kind is a 10-value enum. All of it ends up in the grammar:
the model cannot emit kind: "banana", and
cannot forget var.
items = {type: object} without properties, the grammar
falls back to a generic jsonObject, and the constraint is lost.
So when you add an executor with nested structures, you must declare them
in the manifest — the grammar only protects you if the schema is complete.
runtime/tool_grammar.py::filter_pool_for_grammar)Even with a perfect grammar, there was still a problem: some tools are escape-hatches — always injected into the pool for special use — and with the grammar active the model picks them inappropriately.
| Tool | When to exclude |
|---|---|
request_new_executor | if there are ≥3 canonical tools (it’s a fallback, not a default) |
request_location_from_user | if the query does NOT contain proximity markers (“near me”, “here”, “nearby”) |
undo_last_turn | if the query does NOT contain undo markers (“annulla”, “rollback”) |
*_google_workspace | if the query does NOT mention “google”, “drive”, “gmail”... |
The filter uses word-boundary regex (\bmarker\b) to
avoid false positives like “qua” ⊂ “qualcosa” which cost us another
bench round. The logic is a pure function: 12 unit tests, deterministic (§7.9).
runtime/tool_grammar.py::validate_tool_call)The grammar is tight, but not perfect — some constraints (e.g. mutual exclusivity between two properties) don’t express naturally in GBNF. For those there’s a post-decode validator:
tool_call = {"name": "get_inputs", "arguments": {"display_template": "x"}}
ok, err = validate_tool_call(tool_call, pool)
# ok=False, err="missing required args ['title', 'dialog'] for tool 'get_inputs'"
If the validator says no, the executor is NOT called. The error is injected
into the LLM’s history as a role=tool message, and on the next step
the model sees “you’re missing title and dialog” and
corrects. A consecutive_blocked counter prevents infinite loops.
name.kind has 10 values, the
model picks ONE of those 10.title: "": it’s a string. The executor checks._MAX_RECURSION_DEPTH = 4: beyond, fallback to generic
jsonObject. Rare cases, to monitor.llama.cpp is live software, and its GBNF implementation has a couple of quirks. All discovered during convergence tests in the session.
b540-5755a100c silently ignores rule names that contain
underscores. So args_get_now ::= ... doesn’t apply. Workaround:
rule names always camelCase. Test: test_no_underscore_in_rule_names.
_PRIMITIVE_DEPS), emit ONLY the
primitives actually referenced. Test: test_primitives_only_referenced_emitted.
grammar + tools together — HTTP 400.
llama-server rejects payloads with both fields. Workaround: in grammar-mode we
bypass tools and directly generate the JSON
{"name":..., "arguments":...}. Gemma’s chat-template that emits
<|tool_call> markers is also bypassed.
These are stable workarounds, but they ARE workarounds. If llama.cpp solves them, we can simplify. They’re covered by tests.
Bench n=10 representative queries (simple/medium/complex/ambiguous), n=1 run per query:
| Config | Convergence | first_tool_match | Mean latency |
|---|---|---|---|
| baseline (no grammar) | 50% | n/a | 70s |
| grammar v1 (B2 base) | 80% | 89% | 25s |
| grammar v4 (discriminated) | 80% | 100% | 35s |
| grammar v6 (pool filter) | 100% | 100% | 42s |
| grammar v7 (extracted refactor) | 100% | 100% | 41s |
Three things worth noting:
final_answer, instead of picking an escape-hatch
at random.Grammar generation is deterministic and fast: for a typical pool of 8 tools it produces ~80 rules in ~2 ms. The grammar “costs” the server a bit during generation (filtering candidate tokens), but the net gain is huge because we eliminate:
Adding a new provider qualifier (e.g. _notion or
_telegram_bot) costs one line:
_PROVIDER_SUFFIX_MARKERS = {
"_google_workspace": ("google", "drive", "gmail", ...),
"_notion": ("notion", "notion.so", "page"), # ← added
}
Adding a new executor with complex nested schema costs zero: the generator reads the JSON Schema declared in the manifest and produces the grammar automatically, recursively, up to the depth cap.
43 unit tests in runtime/tests/test_tool_grammar.py green.
Coverage:
args_complexity, is_complex).get_inputs.dialog
emits sub-rule).End-to-end bench: runtime/bench_grammar.py:
METNOS_GRAMMAR=1 python3 runtime/bench_grammar.py --label grammar_v8 # 10 queries × 1 run, ~7 minutes total # output: /tmp/bench_grammar_grammar_v8_<ts>.json # prints aggregate table + per-complexity + row-by-row detail
To enable grammar mode in dev:
echo '[Service] Environment=METNOS_GRAMMAR=1' | sudo tee /etc/systemd/system/metnos-http.service.d/grammar.conf systemctl --user daemon-reload && systemctl --user restart metnos-http.service
Constrained generation isn’t a Metnos invention. Here’s a quick map of who does what in the LLM/agent ecosystem — it helps locate the ADR 0133 choice and what we gained (or gave up) vs alternatives.
| Platform | Mechanism | Hard / soft | Schema source | Notes |
|---|---|---|---|---|
| OpenAI function calling / tools | Model fine-tuned on tool_call. Server emits JSON conforming to the
declared function’s JSON Schema. With tool_choice="required"
forces a tool call. |
Mixed: hard that a tool IS called; soft on internal args structure (model-trained, not hard-rail). | JSON Schema (OpenAPI subset) | Excellent fluency, but black-box: no control if the model invents a field not declared. Network latency + $ cost. |
| Anthropic Claude tool use | Same pattern as OpenAI, with tools defined in input. Sonnet/Opus follow schema tightly, but no formal guarantee. | Soft (model-trained) | JSON Schema | Superior robustness vs local baseline, but same non-locality and cost concerns. Sonnet 4.6 ~$0.30/turn vs local Gemma ~$0.001. |
| LangChain / LlamaIndex | Orchestrator: adapts function calling from various providers (OpenAI, Anthropic, ...) under one interface. Pydantic schema input. | Inherits from underlying provider (soft) | Pydantic / JSON Schema | Abstraction layer, doesn’t add robustness. Useful for portability; doesn’t solve thinking-loop. |
| llama.cpp GBNF Metnos choice | Formal Backus-Naur grammar. Filters candidate tokens at every decoding step. Native in the server. | Hard (sampler-level) | Hand-written or generated from JSON Schema | Maximum control, zero external dependencies, deterministic. Server-bug quirks (see §5). Local, no $ costs. |
| Outlines / Instructor | Python library that builds an FSM (finite state machine) from JSON Schema/Pydantic and applies it at decoding (token-level mask). | Hard (sampler-level) | Pydantic / JSON Schema | Same paradigm as GBNF but library-side. New dependency, limited model support (HuggingFace + vllm). For Metnos = extra weight, same result. |
| xgrammar (vllm) | C++ library that compiles JSON Schema into token masks. Integrated in vllm, coming to other serving stacks. | Hard (sampler-level) | JSON Schema | Equivalent to GBNF, superior performance in some scenarios. Not yet in llama.cpp stable. |
| JSON Mode (OpenAI/Anthropic) | Guarantees syntactically valid JSON, no schema constraint. | Hard on JSON syntax, soft on schema | None | Useful as fallback, insufficient for structured tool_call. |
The ADR 0133 GBNF choice is aligned with the technical frontier: modern serving stacks are all converging on sampler-level constraints (Outlines, xgrammar, llama.cpp GBNF). It’s the state of the art pattern for local tool calling.
The difference vs OpenAI/Anthropic is that we write the grammar instead of delegating to a fine-tuned model. Advantages:
Disadvantages:
| Trigger | Direction |
|---|---|
| llama.cpp introduces native xgrammar | Switch to xgrammar (same semantics, better perf) |
| Multi-model cross-provider (e.g. switch Gemma ↔ Sonnet) | Add Outlines adapter for non-llama.cpp providers |
| Query corpus convergence drops below 95% | Investigate if it’s the model or filter; possibly local fine-tune |
| Unsolvable llama-server bugs | Fork generator to 2 dialects (GBNF + JSON Schema FSM) |
runtime/tool_grammar.pyruntime/llm_provider.py::LlamaCppProvider.chat_with_toolsruntime/bench_grammar.pyruntime/tests/test_tool_grammar.py