← Documentation index Microdesign › grammar

Metnos

grammar — constrained tool_call generation

Microdesign v1.0 — 14 May 2026
Aligned with ADR 0133
and the modules runtime/tool_grammar.py, runtime/llm_provider.py, runtime/agent_runtime.py.

Audience: anyone who wants to understand why Metnos doesn’t “ramble” during planning,
and what happens between the LLM and the executor.
Read time: 12 minutes, zero math.

The problem: the LLM “thinks about it” and gets lost
The idea: grammar as a guardrail
Three pieces: generator, filter, validator
What it guarantees and what it doesn’t
Why three “llama-server quirks” cost us half a day
Robustness, speed, extensibility
How to test
Comparison with other platforms
References

1. The problem: the LLM “thinks about it” and gets lost

The Metnos PLANNER picks which executor to call at each step. It’s a local LLM (Gemma 4 26B). When the query is complex — like “propose 3 morning slots next week for an appointment with Silvia, after I pick one, email me the choice” — the LLM sometimes enters a reasoning loop and ends up emitting prose instead of a structured tool_call. The system doesn’t know what to run. Game over.

This is called “thinking loop”. Symptoms observed:

Prose instead of JSON. The LLM writes “Wait, I’ll think...” and forgets to close with a tool_call.
Name/args mix-match. Emits {"name":"get_inputs","arguments": {"kind":"choice","from_step":1,...}} — the name of one tool with the arguments of another tool.
Arbitrary escape-hatches. Picks request_new_executor (i.e. “synthesize me a new verb”) even when there’s an obvious canonical executor already. Or request_location_from_user for the query “create folder /tmp/x” (no, I don’t need your GPS location).

Bench before this decision: 50–70% convergence on 10 representative queries. The propose+notify pipeline from ADR 0129 was failing. Unacceptable for a personal assistant.

2. The idea: grammar as a guardrail

llama.cpp (the engine that runs Gemma) supports a feature called GBNF (Grammar Backus-Naur Form). It’s a language for writing formal grammars, like:

root      ::= "{" ws "\"name\":" name ws "}"
name      ::= "\"get_now\"" | "\"find_files\"" | "\"send_messages\""
ws        ::= [ \t\n]*

When you pass a grammar to the server, the LLM physically cannot emit tokens that violate it. Every candidate token is filtered against the grammar before being chosen. It’s a hard rail, not a suggestion.

Analogy. A soft constraint via prompt is like saying “please drive in the right lane”. A grammar is like installing a guardrail. The first can be ignored, the second cannot.

The ADR 0133 idea: at each PLANNER step, we generate a specific grammar from the step’s tools pool, and we pass it to the server. The model is free to pick which tool to call and which arguments to pass, but ONLY within what the schemas declare valid.

3. Three pieces: generator, filter, validator

3.1 Generator (`runtime/tool_grammar.py::generate_tool_grammar`)

Takes the pool of tools and returns a GBNF string. The grammar core is a discriminated union:

root ::= "{" ws "\"name\":" (pairGetInputs | pairSendMessages | ...) ws "}"
pairGetInputs ::= "\"get_inputs\"" sep "\"arguments\":" colon (argsGetInputs)
pairSendMessages ::= "\"send_messages\"" sep "\"arguments\":" colon (argsSendMessages)
...

The keyword is discriminated: if the LLM picks the pairGetInputs branch, the args must conform to argsGetInputs. It cannot mix the name of one tool with the args of another — which was the live bug that blocked us for half a day.

For args, the generator reads the JSON Schema declared in the executor’s TOML manifest and translates it into GBNF rules. For nested cases (array of object, object inside object) the generator is recursive:

argsGetInputs ::= "{" ws propGetInputsTitle sep propGetInputsDialog
                  (sep (propGetInputsFromStep | ...))* ws "}"
propGetInputsDialog ::= "\"dialog\":" colon (
    "[" ws (GetInputsObjD1I0 (sep GetInputsObjD1I0)*)? ws "]"
)
GetInputsObjD1I0 ::= "{" ws propGetInputsObjD1I0Var sep propGetInputsObjD1I0Prompt sep
                          propGetInputsObjD1I0Schema (sep (...))* ws "}"
propGetInputsObjD1I0Schema ::= "\"schema\":" colon (GetInputsObjD3I2)
GetInputsObjD3I2 ::= "{" ws propGetInputsObjD3I2Kind (sep (...))* ws "}"
propGetInputsObjD3I2Kind ::= "\"kind\":" colon (
    "\"text\"" | "\"credentials\"" | "\"yes_no\"" | "\"choice\"" |
    "\"choice_with_preview\"" | "\"multi_choice\"" | "\"number\"" |
    "\"date\"" | "\"file_path\"" | "\"location\""
)

Real example for get_inputs: the schema declares dialog as an array of object, each {var, prompt, schema}, where schema.kind is a 10-value enum. All of it ends up in the grammar: the model cannot emit kind: "banana", and cannot forget var.

Schema = source of truth. If the manifest declares items = {type: object} without properties, the grammar falls back to a generic jsonObject, and the constraint is lost. So when you add an executor with nested structures, you must declare them in the manifest — the grammar only protects you if the schema is complete.

3.2 Pool filter (`runtime/tool_grammar.py::filter_pool_for_grammar`)

Even with a perfect grammar, there was still a problem: some tools are escape-hatches — always injected into the pool for special use — and with the grammar active the model picks them inappropriately.

Tool	When to exclude
`request_new_executor`	if there are ≥3 canonical tools (it’s a fallback, not a default)
`request_location_from_user`	if the query does NOT contain proximity markers (“near me”, “here”, “nearby”)
`undo_last_turn`	if the query does NOT contain undo markers (“annulla”, “rollback”)
`*_google_workspace`	if the query does NOT mention “google”, “drive”, “gmail”...

The filter uses word-boundary regex (\bmarker\b) to avoid false positives like “qua” ⊂ “qualcosa” which cost us another bench round. The logic is a pure function: 12 unit tests, deterministic (§7.9).

3.3 Validator (`runtime/tool_grammar.py::validate_tool_call`)

The grammar is tight, but not perfect — some constraints (e.g. mutual exclusivity between two properties) don’t express naturally in GBNF. For those there’s a post-decode validator:

tool_call = {"name": "get_inputs", "arguments": {"display_template": "x"}}
ok, err = validate_tool_call(tool_call, pool)
# ok=False, err="missing required args ['title', 'dialog'] for tool 'get_inputs'"

If the validator says no, the executor is NOT called. The error is injected into the LLM’s history as a role=tool message, and on the next step the model sees “you’re missing title and dialog” and corrects. A consecutive_blocked counter prevents infinite loops.

4. What it guarantees and what it doesn’t

Guarantees

Valid JSON syntax. The model cannot emit malformed JSON.
Tool name in the pool. Only tools from the step pool are admitted as name.
Args coherent with the name. Discriminated union: name A ⇒ args of A.
Top-level and nested property types. String, integer, boolean, enum, array, object — each gets its rule.
Strict enums. If kind has 10 values, the model picks ONE of those 10.

Doesn’t guarantee

Semantically correct value. The grammar accepts title: "": it’s a string. The executor checks.
Deep required nested. Cap _MAX_RECURSION_DEPTH = 4: beyond, fallback to generic jsonObject. Rare cases, to monitor.
The right tool for the query. If the pool has 8 plausible tools, the model picks the one that seems best — and can be wrong. That’s what the ranker (top-K) before the grammar is for, plus the pool filter.

5. Why three “llama-server quirks” cost us half a day

llama.cpp is live software, and its GBNF implementation has a couple of quirks. All discovered during convergence tests in the session.

1. Rule names with underscore — ignored. Version b540-5755a100c silently ignores rule names that contain underscores. So args_get_now ::= ... doesn’t apply. Workaround: rule names always camelCase. Test: test_no_underscore_in_rule_names.

2. Unused rules interfere. Even if a rule is referenced by no other rule, its presence can alter matching. Workaround: dependency closure (_PRIMITIVE_DEPS), emit ONLY the primitives actually referenced. Test: test_primitives_only_referenced_emitted.

3. grammar + tools together — HTTP 400. llama-server rejects payloads with both fields. Workaround: in grammar-mode we bypass tools and directly generate the JSON {"name":..., "arguments":...}. Gemma’s chat-template that emits <|tool_call> markers is also bypassed.

These are stable workarounds, but they ARE workarounds. If llama.cpp solves them, we can simplify. They’re covered by tests.

6. Robustness, speed, extensibility

Robustness

Bench n=10 representative queries (simple/medium/complex/ambiguous), n=1 run per query:

Config	Convergence	first_tool_match	Mean latency
baseline (no grammar)	50%	n/a	70s
grammar v1 (B2 base)	80%	89%	25s
grammar v4 (discriminated)	80%	100%	35s
grammar v6 (pool filter)	100%	100%	42s
grammar v7 (extracted refactor)	100%	100%	41s

Three things worth noting:

The thinking-loop is gone. Mean latency halved vs baseline, not because the LLM “thinks faster”, but because it doesn’t waste cycles reasoning in a circle.
The propose+notify pipeline (complex_propose_silvia) always converges. It was the worst case.
Even ambiguous queries (“do something interesting”) close with a sensible final_answer, instead of picking an escape-hatch at random.

Speed

Grammar generation is deterministic and fast: for a typical pool of 8 tools it produces ~80 rules in ~2 ms. The grammar “costs” the server a bit during generation (filtering candidate tokens), but the net gain is huge because we eliminate:

Prose thinking (hundreds of “Wait, I’ll check...” tokens).
Retries due to malformed JSON or wrong args.
Recursive reasoning loops.

Extensibility

Adding a new provider qualifier (e.g. _notion or _telegram_bot) costs one line:

_PROVIDER_SUFFIX_MARKERS = {
    "_google_workspace": ("google", "drive", "gmail", ...),
    "_notion": ("notion", "notion.so", "page"),  # ← added
}

Adding a new executor with complex nested schema costs zero: the generator reads the JSON Schema declared in the manifest and produces the grammar automatically, recursively, up to the depth cap.

7. How to test

43 unit tests in runtime/tests/test_tool_grammar.py green. Coverage:

Complexity scoring (args_complexity, is_complex).
Generator outer (root, name enum, args alternation, deterministic).
Discriminated union (pair rules, no cross-pollination).
B2 recursive (array of object, nested object, depth cap, no cross-tool collision).
Anti-bug (no underscore in rules, no unused primitives emitted).
Validator (missing name, missing args, unknown tool, required missing, happy path, top-level enum passes).
Pool filter (escape-hatch, word-boundary, provider suffix, safety, determinism).
Smoke integration with real catalog (get_inputs.dialog emits sub-rule).

End-to-end bench: runtime/bench_grammar.py:

METNOS_GRAMMAR=1 python3 runtime/bench_grammar.py --label grammar_v8
# 10 queries × 1 run, ~7 minutes total
# output: /tmp/bench_grammar_grammar_v8_<ts>.json
# prints aggregate table + per-complexity + row-by-row detail

To enable grammar mode in dev:

echo '[Service]
Environment=METNOS_GRAMMAR=1' | sudo tee /etc/systemd/system/metnos-http.service.d/grammar.conf
systemctl --user daemon-reload && systemctl --user restart metnos-http.service

8. Comparison with other platforms

Constrained generation isn’t a Metnos invention. Here’s a quick map of who does what in the LLM/agent ecosystem — it helps locate the ADR 0133 choice and what we gained (or gave up) vs alternatives.

Platform	Mechanism	Hard / soft	Schema source	Notes
OpenAI function calling / tools	Model fine-tuned on tool_call. Server emits JSON conforming to the declared function’s JSON Schema. With `tool_choice="required"` forces a tool call.	Mixed: hard that a tool IS called; soft on internal args structure (model-trained, not hard-rail).	JSON Schema (OpenAPI subset)	Excellent fluency, but black-box: no control if the model invents a field not declared. Network latency + $ cost.
Anthropic Claude tool use	Same pattern as OpenAI, with tools defined in input. Sonnet/Opus follow schema tightly, but no formal guarantee.	Soft (model-trained)	JSON Schema	Superior robustness vs local baseline, but same non-locality and cost concerns. Sonnet 4.6 ~$0.30/turn vs local Gemma ~$0.001.
LangChain / LlamaIndex	Orchestrator: adapts function calling from various providers (OpenAI, Anthropic, ...) under one interface. Pydantic schema input.	Inherits from underlying provider (soft)	Pydantic / JSON Schema	Abstraction layer, doesn’t add robustness. Useful for portability; doesn’t solve thinking-loop.
llama.cpp GBNF Metnos choice	Formal Backus-Naur grammar. Filters candidate tokens at every decoding step. Native in the server.	Hard (sampler-level)	Hand-written or generated from JSON Schema	Maximum control, zero external dependencies, deterministic. Server-bug quirks (see §5). Local, no $ costs.
Outlines / Instructor	Python library that builds an FSM (finite state machine) from JSON Schema/Pydantic and applies it at decoding (token-level mask).	Hard (sampler-level)	Pydantic / JSON Schema	Same paradigm as GBNF but library-side. New dependency, limited model support (HuggingFace + vllm). For Metnos = extra weight, same result.
xgrammar (vllm)	C++ library that compiles JSON Schema into token masks. Integrated in vllm, coming to other serving stacks.	Hard (sampler-level)	JSON Schema	Equivalent to GBNF, superior performance in some scenarios. Not yet in llama.cpp stable.
JSON Mode (OpenAI/Anthropic)	Guarantees syntactically valid JSON, no schema constraint.	Hard on JSON syntax, soft on schema	None	Useful as fallback, insufficient for structured tool_call.

Where Metnos sits

The ADR 0133 GBNF choice is aligned with the technical frontier: modern serving stacks are all converging on sampler-level constraints (Outlines, xgrammar, llama.cpp GBNF). It’s the state of the art pattern for local tool calling.

The difference vs OpenAI/Anthropic is that we write the grammar instead of delegating to a fine-tuned model. Advantages:

Local, zero network, zero $ costs. Aligned with the Metnos §10.3 principle (open-source self-hosted as default).
Deterministic §7.9 — testable with 43 unit tests.
Full transparency: the grammar is text, inspectable, versionable.
Immediate update: adding an executor doesn’t require re-training. TOML schema → grammar emitted on the fly.

Disadvantages:

Models fine-tuned on tool_call (GPT-5, Sonnet) are more fluent on very open queries — they don’t pick a tool incorrectly because they’re trained to reason semantically on the match. We compensate with top-K ranker + contextual pool filter.
Server bugs (see §5) fall on us. Mitigated by regression tests.

When it would make sense to change

Trigger	Direction
llama.cpp introduces native xgrammar	Switch to xgrammar (same semantics, better perf)
Multi-model cross-provider (e.g. switch Gemma ↔ Sonnet)	Add Outlines adapter for non-llama.cpp providers
Query corpus convergence drops below 95%	Investigate if it’s the model or filter; possibly local fine-tune
Unsolvable llama-server bugs	Fork generator to 2 dialects (GBNF + JSON Schema FSM)

9. References

Decision: ADR 0133
Generator: runtime/tool_grammar.py
Provider wiring: runtime/llm_provider.py::LlamaCppProvider.chat_with_tools
Bench: runtime/bench_grammar.py
Tests: runtime/tests/test_tool_grammar.py
llama.cpp GBNF docs: grammars README
JSON Schema basics: json-schema.org

Metnos

Contents

1. The problem: the LLM “thinks about it” and gets lost

2. The idea: grammar as a guardrail

3. Three pieces: generator, filter, validator

3.1 Generator (runtime/tool_grammar.py::generate_tool_grammar)

3.2 Pool filter (runtime/tool_grammar.py::filter_pool_for_grammar)

3.3 Validator (runtime/tool_grammar.py::validate_tool_call)

4. What it guarantees and what it doesn’t

Guarantees

Doesn’t guarantee

5. Why three “llama-server quirks” cost us half a day

6. Robustness, speed, extensibility

Robustness

Speed

Extensibility

7. How to test

8. Comparison with other platforms

Where Metnos sits

When it would make sense to change

9. References

3.1 Generator (`runtime/tool_grammar.py::generate_tool_grammar`)

3.2 Pool filter (`runtime/tool_grammar.py::filter_pool_for_grammar`)

3.3 Validator (`runtime/tool_grammar.py::validate_tool_call`)