← Documentation index Microdesign › grammar

Metnos

grammar — constrained tool_call generation
Microdesign v1.0 — 14 May 2026
Aligned with ADR 0133
and the modules runtime/tool_grammar.py, runtime/llm_provider.py, runtime/agent_runtime.py.

Audience: anyone who wants to understand why Metnos doesn’t “ramble” during planning,
and what happens between the LLM and the executor.
Read time: 12 minutes, zero math.

Contents

  1. The problem: the LLM “thinks about it” and gets lost
  2. The idea: grammar as a guardrail
  3. Three pieces: generator, filter, validator
  4. What it guarantees and what it doesn’t
  5. Why three “llama-server quirks” cost us half a day
  6. Robustness, speed, extensibility
  7. How to test
  8. Comparison with other platforms
  9. References

1. The problem: the LLM “thinks about it” and gets lost

The Metnos PLANNER picks which executor to call at each step. It’s a local LLM (Gemma 4 26B). When the query is complex — like “propose 3 morning slots next week for an appointment with Silvia, after I pick one, email me the choice” — the LLM sometimes enters a reasoning loop and ends up emitting prose instead of a structured tool_call. The system doesn’t know what to run. Game over.

This is called “thinking loop”. Symptoms observed:

Bench before this decision: 50–70% convergence on 10 representative queries. The propose+notify pipeline from ADR 0129 was failing. Unacceptable for a personal assistant.

2. The idea: grammar as a guardrail

llama.cpp (the engine that runs Gemma) supports a feature called GBNF (Grammar Backus-Naur Form). It’s a language for writing formal grammars, like:

root      ::= "{" ws "\"name\":" name ws "}"
name      ::= "\"get_now\"" | "\"find_files\"" | "\"send_messages\""
ws        ::= [ \t\n]*

When you pass a grammar to the server, the LLM physically cannot emit tokens that violate it. Every candidate token is filtered against the grammar before being chosen. It’s a hard rail, not a suggestion.

Analogy. A soft constraint via prompt is like saying “please drive in the right lane”. A grammar is like installing a guardrail. The first can be ignored, the second cannot.

The ADR 0133 idea: at each PLANNER step, we generate a specific grammar from the step’s tools pool, and we pass it to the server. The model is free to pick which tool to call and which arguments to pass, but ONLY within what the schemas declare valid.

3. Three pieces: generator, filter, validator

3.1 Generator (runtime/tool_grammar.py::generate_tool_grammar)

Takes the pool of tools and returns a GBNF string. The grammar core is a discriminated union:

root ::= "{" ws "\"name\":" (pairGetInputs | pairSendMessages | ...) ws "}"
pairGetInputs ::= "\"get_inputs\"" sep "\"arguments\":" colon (argsGetInputs)
pairSendMessages ::= "\"send_messages\"" sep "\"arguments\":" colon (argsSendMessages)
...

The keyword is discriminated: if the LLM picks the pairGetInputs branch, the args must conform to argsGetInputs. It cannot mix the name of one tool with the args of another — which was the live bug that blocked us for half a day.

For args, the generator reads the JSON Schema declared in the executor’s TOML manifest and translates it into GBNF rules. For nested cases (array of object, object inside object) the generator is recursive:

argsGetInputs ::= "{" ws propGetInputsTitle sep propGetInputsDialog
                  (sep (propGetInputsFromStep | ...))* ws "}"
propGetInputsDialog ::= "\"dialog\":" colon (
    "[" ws (GetInputsObjD1I0 (sep GetInputsObjD1I0)*)? ws "]"
)
GetInputsObjD1I0 ::= "{" ws propGetInputsObjD1I0Var sep propGetInputsObjD1I0Prompt sep
                          propGetInputsObjD1I0Schema (sep (...))* ws "}"
propGetInputsObjD1I0Schema ::= "\"schema\":" colon (GetInputsObjD3I2)
GetInputsObjD3I2 ::= "{" ws propGetInputsObjD3I2Kind (sep (...))* ws "}"
propGetInputsObjD3I2Kind ::= "\"kind\":" colon (
    "\"text\"" | "\"credentials\"" | "\"yes_no\"" | "\"choice\"" |
    "\"choice_with_preview\"" | "\"multi_choice\"" | "\"number\"" |
    "\"date\"" | "\"file_path\"" | "\"location\""
)

Real example for get_inputs: the schema declares dialog as an array of object, each {var, prompt, schema}, where schema.kind is a 10-value enum. All of it ends up in the grammar: the model cannot emit kind: "banana", and cannot forget var.

Schema = source of truth. If the manifest declares items = {type: object} without properties, the grammar falls back to a generic jsonObject, and the constraint is lost. So when you add an executor with nested structures, you must declare them in the manifest — the grammar only protects you if the schema is complete.

3.2 Pool filter (runtime/tool_grammar.py::filter_pool_for_grammar)

Even with a perfect grammar, there was still a problem: some tools are escape-hatches — always injected into the pool for special use — and with the grammar active the model picks them inappropriately.

ToolWhen to exclude
request_new_executorif there are ≥3 canonical tools (it’s a fallback, not a default)
request_location_from_userif the query does NOT contain proximity markers (“near me”, “here”, “nearby”)
undo_last_turnif the query does NOT contain undo markers (“annulla”, “rollback”)
*_google_workspaceif the query does NOT mention “google”, “drive”, “gmail”...

The filter uses word-boundary regex (\bmarker\b) to avoid false positives like “qua” ⊂ “qualcosa” which cost us another bench round. The logic is a pure function: 12 unit tests, deterministic (§7.9).

3.3 Validator (runtime/tool_grammar.py::validate_tool_call)

The grammar is tight, but not perfect — some constraints (e.g. mutual exclusivity between two properties) don’t express naturally in GBNF. For those there’s a post-decode validator:

tool_call = {"name": "get_inputs", "arguments": {"display_template": "x"}}
ok, err = validate_tool_call(tool_call, pool)
# ok=False, err="missing required args ['title', 'dialog'] for tool 'get_inputs'"

If the validator says no, the executor is NOT called. The error is injected into the LLM’s history as a role=tool message, and on the next step the model sees “you’re missing title and dialog” and corrects. A consecutive_blocked counter prevents infinite loops.

4. What it guarantees and what it doesn’t

Guarantees

Doesn’t guarantee

5. Why three “llama-server quirks” cost us half a day

llama.cpp is live software, and its GBNF implementation has a couple of quirks. All discovered during convergence tests in the session.

1. Rule names with underscore — ignored. Version b540-5755a100c silently ignores rule names that contain underscores. So args_get_now ::= ... doesn’t apply. Workaround: rule names always camelCase. Test: test_no_underscore_in_rule_names.
2. Unused rules interfere. Even if a rule is referenced by no other rule, its presence can alter matching. Workaround: dependency closure (_PRIMITIVE_DEPS), emit ONLY the primitives actually referenced. Test: test_primitives_only_referenced_emitted.
3. grammar + tools together — HTTP 400. llama-server rejects payloads with both fields. Workaround: in grammar-mode we bypass tools and directly generate the JSON {"name":..., "arguments":...}. Gemma’s chat-template that emits <|tool_call> markers is also bypassed.

These are stable workarounds, but they ARE workarounds. If llama.cpp solves them, we can simplify. They’re covered by tests.

6. Robustness, speed, extensibility

Robustness

Bench n=10 representative queries (simple/medium/complex/ambiguous), n=1 run per query:

ConfigConvergencefirst_tool_matchMean latency
baseline (no grammar)50%n/a70s
grammar v1 (B2 base)80%89%25s
grammar v4 (discriminated)80%100%35s
grammar v6 (pool filter)100%100%42s
grammar v7 (extracted refactor)100%100%41s

Three things worth noting:

Speed

Grammar generation is deterministic and fast: for a typical pool of 8 tools it produces ~80 rules in ~2 ms. The grammar “costs” the server a bit during generation (filtering candidate tokens), but the net gain is huge because we eliminate:

Extensibility

Adding a new provider qualifier (e.g. _notion or _telegram_bot) costs one line:

_PROVIDER_SUFFIX_MARKERS = {
    "_google_workspace": ("google", "drive", "gmail", ...),
    "_notion": ("notion", "notion.so", "page"),  # ← added
}

Adding a new executor with complex nested schema costs zero: the generator reads the JSON Schema declared in the manifest and produces the grammar automatically, recursively, up to the depth cap.

7. How to test

43 unit tests in runtime/tests/test_tool_grammar.py green. Coverage:

End-to-end bench: runtime/bench_grammar.py:

METNOS_GRAMMAR=1 python3 runtime/bench_grammar.py --label grammar_v8
# 10 queries × 1 run, ~7 minutes total
# output: /tmp/bench_grammar_grammar_v8_<ts>.json
# prints aggregate table + per-complexity + row-by-row detail

To enable grammar mode in dev:

echo '[Service]
Environment=METNOS_GRAMMAR=1' | sudo tee /etc/systemd/system/metnos-http.service.d/grammar.conf
systemctl --user daemon-reload && systemctl --user restart metnos-http.service

8. Comparison with other platforms

Constrained generation isn’t a Metnos invention. Here’s a quick map of who does what in the LLM/agent ecosystem — it helps locate the ADR 0133 choice and what we gained (or gave up) vs alternatives.

Platform Mechanism Hard / soft Schema source Notes
OpenAI function calling / tools Model fine-tuned on tool_call. Server emits JSON conforming to the declared function’s JSON Schema. With tool_choice="required" forces a tool call. Mixed: hard that a tool IS called; soft on internal args structure (model-trained, not hard-rail). JSON Schema (OpenAPI subset) Excellent fluency, but black-box: no control if the model invents a field not declared. Network latency + $ cost.
Anthropic Claude tool use Same pattern as OpenAI, with tools defined in input. Sonnet/Opus follow schema tightly, but no formal guarantee. Soft (model-trained) JSON Schema Superior robustness vs local baseline, but same non-locality and cost concerns. Sonnet 4.6 ~$0.30/turn vs local Gemma ~$0.001.
LangChain / LlamaIndex Orchestrator: adapts function calling from various providers (OpenAI, Anthropic, ...) under one interface. Pydantic schema input. Inherits from underlying provider (soft) Pydantic / JSON Schema Abstraction layer, doesn’t add robustness. Useful for portability; doesn’t solve thinking-loop.
llama.cpp GBNF Metnos choice Formal Backus-Naur grammar. Filters candidate tokens at every decoding step. Native in the server. Hard (sampler-level) Hand-written or generated from JSON Schema Maximum control, zero external dependencies, deterministic. Server-bug quirks (see §5). Local, no $ costs.
Outlines / Instructor Python library that builds an FSM (finite state machine) from JSON Schema/Pydantic and applies it at decoding (token-level mask). Hard (sampler-level) Pydantic / JSON Schema Same paradigm as GBNF but library-side. New dependency, limited model support (HuggingFace + vllm). For Metnos = extra weight, same result.
xgrammar (vllm) C++ library that compiles JSON Schema into token masks. Integrated in vllm, coming to other serving stacks. Hard (sampler-level) JSON Schema Equivalent to GBNF, superior performance in some scenarios. Not yet in llama.cpp stable.
JSON Mode (OpenAI/Anthropic) Guarantees syntactically valid JSON, no schema constraint. Hard on JSON syntax, soft on schema None Useful as fallback, insufficient for structured tool_call.

Where Metnos sits

The ADR 0133 GBNF choice is aligned with the technical frontier: modern serving stacks are all converging on sampler-level constraints (Outlines, xgrammar, llama.cpp GBNF). It’s the state of the art pattern for local tool calling.

The difference vs OpenAI/Anthropic is that we write the grammar instead of delegating to a fine-tuned model. Advantages:

Disadvantages:

When it would make sense to change

TriggerDirection
llama.cpp introduces native xgrammarSwitch to xgrammar (same semantics, better perf)
Multi-model cross-provider (e.g. switch Gemma ↔ Sonnet)Add Outlines adapter for non-llama.cpp providers
Query corpus convergence drops below 95%Investigate if it’s the model or filter; possibly local fine-tune
Unsolvable llama-server bugsFork generator to 2 dialects (GBNF + JSON Schema FSM)

9. References