TESTED Microdesign v1.1 — tested on April 26, 2026. The component has been built and tested at module and cluster level in POC v1.1 (test framework: cluster agent_runtime 109/109 green, system total 135/135). Implementation reference: /opt/myclaw/runtime/agent_runtime.py.
Status sequence of the microdesign: under approvalapprovedtestedimplemented (when the whole system is ready in production). Replaces version 1.0 of 24/4/2026, now obsolete.
← Documentation index Microdesign › agent_runtime

Metnos

agent_runtime — the turn-by-turn planner
Microdesign v1.1 — status TESTED (April 26, 2026)
Validated by POC v1.1 with the 3-level SQLite test framework
Audience: those who read to understand how Metnos works inside;
those who implement or extend a runtime component.

Reading time: 18 minutes.

Index

  1. What it does: the story of one request
  2. The loop of one turn, step by step
  3. Modes: local, online, hybrid
  4. Three tiers of LLM: fast, middle, wise
  5. The pre-filter: choosing the sub-catalog
  6. Native tool-use (no JSON parsing)
  7. Data piping between steps: {{stepN.field}}
  8. Scratchpad for large observations
  9. Vaglio of the plan
  10. Safety caps and runtime guards
  11. What it writes to logs
  12. When things go wrong
  13. What is deferred to v1.2+

1. What it does: the story of one request

Imagine writing to Metnos: «read the file ~/notes/diary.md and tell me the last three lines». From that moment on a turn begins. The planner — the module we describe here — receives the sentence, decides what to do, sets up the steps, and produces the answer.

To understand what "deciding what to do" means, let us follow a concrete example. The request is the one above. The planner has no hardcoded logic that says "if the user says 'read a file' call fs_read"; the idea is different:

  1. The planner looks at the catalog of executors available (small signed programs that know how to do one thing only: read a file, write one, make an HTTP call, read the clock, etc.).
  2. It selects a subset relevant to the request (here the candidates will be fs_read, fs_write, web_fetch).
  3. It passes those candidates to an LLM (a language model) as "tools" it can use.
  4. The LLM reads the request and proposes the next step: "call fs_read with path=~/notes/diary.md and tail_bytes=N (N large enough to contain 3 lines)".
  5. The planner validates the proposal (sandbox, args, constitutional vaglio), invokes the executor, collects the output.
  6. It hands back to the LLM with: the original request + the output of the previous step.
  7. The LLM says: "I have everything, the answer is: here are the last three lines of your diary".
  8. The planner delivers that answer to the user, writes the log, ends the turn.

The key point: useful behaviour emerges from composition, not from hardcoded rules. If tomorrow the user asks "fetch a page, save it to a file, and tell me how many bytes you wrote", the planner will compose web_fetch + fs_write + final_answer with no need for any special case. The same pipeline, a different request, a different sequence of steps.

What the planner does not do. It never directly resolves user requests. It does not read files, does not call URLs, does not write anything. All of that is done by the executors (specialised programs). The planner only decides whom to call and with what arguments, then collects the results.

2. The loop of one turn, step by step

Now let us see the same path more precisely. A turn always has this structure:

User query Loader → Catalog Mode + Tier Pre-filter top-K LLMnative tool-use resolve refs + validatesandbox + vaglio + guard execute (subprocess) observation→ scratchpad if >4KB history update loop until final_answer final_answertext for the user cap reached / errorfinal_kind = error / cap_* turn JSONL log~/.local/share/metnos/turns/YYYY-MM-DD.jsonl
Flow of a turn: from the user query to the final log, with the multistep loop at the centre.

Phase 0 — preparation

Phase 1 — pre-filter

The pre-filter ranks the catalog by relevance to the user query and selects a subset: the top-K. Without a pre-filter, we would have to pass to the LLM all executors (potentially dozens or hundreds), and the quality of the choice would degrade together with the latency. The pre-filter solves the problem in a trivial yet effective way (see ch. 5).

Phase 2 — step loop

At this point the turn enters a loop. For each step:

  1. The LLM is called passing: the user query, the sub-catalog (as native "tools"), and the history of observations from previous steps.
  2. The LLM replies in one of two ways:
  3. If it is a tool_call, the planner runs a series of checks:
    1. Resolve references: if the args contain {{stepN.field}}, it substitutes them with the real value (see ch. 7).
    2. Validate: do the args respect the JSON Schema declared in the executor's manifest?
    3. Sandbox check: is the requested path/host within the scope declared in the executor's hint?
    4. Vaglio: does the constitutional evaluator give the green light?
    5. Guard duplicate read: are we re-reading the same path/url as a previous step? If so, intercept and suggest the LLM to formulate the final_answer.
  4. If all checks pass, the executor is run as a subprocess. The observation (a JSON with {ok, content?, metadata?, error?}) returns to the planner.
  5. If the observation is larger than a threshold (4 KB of JSON), it is saved to scratchpad and the LLM history gets a synthetic version with id + summary (see ch. 8).
  6. The turn history grows by one step. The loop resumes from point 1.

Phase 3 — turn closure

The turn ends in one of the following ways:

In every case, the planner writes a complete JSONL record of the turn (see ch. 11) and returns (log, final_message).

Real example from the POC

Query: «fetch https://httpbin.org/uuid and tell me only the UUID».

Step 1 — the LLM proposes web_fetch(url=https://httpbin.org/uuid). Execution: returns JSON {"uuid":"7c089d54-..."}.

Step 2 — the LLM reads the history, understands it has what it needs, produces final_answer: "7c089d54-...".

Total latency: 3.3 seconds (Qwen3:8b local). No pre-coding for "extract uuid": the LLM reasons over the JSON and extracts the field on its own.

3. Modes: local, online, hybrid

The planner works in three modes, chosen by configuration. The main difference among the three is how many round-trips to the LLM are needed to complete a turn and where the LLM runs.

ModeLoopTypical LLMWhen it makes sense
localmultistep ReAct (one round-trip per step)local (Ollama)Default. Maximum privacy, zero cost, decent latency.
onlinesingle-shot (one round-trip for the whole plan)frontier (Anthropic, OpenAI)Complex tasks worth the money spent. A frontier model succeeds in one call where a local one would have needed five.
hybridlocal by default; escalation to online for critical tasksmixedBalance of cost and quality. For domestic Metnos.

The principle: the shape of the loop follows from the per-call cost. A free local LLM can afford to iterate; an expensive frontier LLM compresses everything into one call. Same planner, behaviour adapted to cost.

In v1.1 we tested the local mode; online is wired as scaffolding (Anthropic provider stub) but not exercised; hybrid has the ModeRouter in place which in PoC always returns "local". Extensions will require: configuring an online provider + enabling the routing rules in [runtime.hybrid].

4. Three tiers of LLM: fast, middle, wise

Independently of the mode (which says where the LLM runs), the runtime exposes three tiers: three pointers to LLMs with different characteristics. Each runtime component picks the tier it needs.

TierCharacteristicCandidate
fast & furiousSmall, fast, always LOCAL, always available.qwen3:8b with think=false
middle & trustableIntermediate reasoning capacity.gemma3:12b local, or claude-haiku-4-5 online.
slow & wiseMaximum reasoning capacity, accepts latency/cost.claude-sonnet-4-6 online, or qwen3:32b with think=true on a powerful local.

The standard planner uses fast. The constitutional vaglio (when real) will use middle. The synthesiser of new executors (synt) will use wise. When a section is missing from the configuration, the pointer aliases to the lower tier: middle → fast, wise → middle → fast. Never a crash, never an exception: the presence of fast alone guarantees operation.

Three tiers, the same runtime fast & furious qwen3:8b · think=false ~700ms/call always LOCAL always available used by: planner scratchpad summariser middle & trustable gemma3:12b or haiku-4-5 ~2-5s/call local or online intermediate reasoning used by: real vaglio synt.compose slow & wise claude-sonnet or qwen3:32b+think 5-30s/call accepts latency/$ deep reflection used by: synt.generate delicate decisions When a tier is not configured, it aliases to the lower tier (wise→middle→fast). Worst case: all point to fast.
The three tiers coexist in the runtime; each component picks the one it needs.
Specialisation even with a single model. If hardware allows only one local LLM, the three tiers point to the same one (e.g. all to qwen3:8b). Same model but the system prompt changes per tier: fast asks for direct action, middle asks to consider alternatives, wise activates think=true and asks for deep reflection. Same model, different roles, different outputs. The user obtains three points of view even with a single brain.

Minimal config schema:

[runtime.llm.fast]
provider = "ollama"
model    = "qwen3:8b"
think    = false

[runtime.llm.middle]
provider = "ollama"
model    = "gemma3:12b"

[runtime.llm.wise]
provider = "anthropic"
model    = "claude-sonnet-4-6"

In v1.1 the structure is implemented but full differentiation is not yet: in PoC the three pointers are identical (qwen3:8b) and the prompt is a single one. The per-tier prompt differentiation is deferred until we have a second backend or a use case that requires it.

5. The pre-filter: choosing the sub-catalog

With a catalog of 30+ executors, passing them all as "tools" to the LLM blows up the prompt and dilutes the attention. The pre-filter ranks by relevance and passes to the LLM only the most promising ones.

How it ranks

In v1.1 the ranking is bag-of-words: it tokenises the user query, tokenises the affinity declared in each manifest, sums the matches (matches on the affinity count double those on the description). Extracting the tokens means: lowercase, alphanumeric words, no accents.

Example

Query: «fetch https://httpbin.org/get». Tokens: {fetch, httpbin, org, get, https}.

Match against affinity of web_fetch: web, http, url, fetch, scarica, leggi, pagina, api, rest. Match: fetch. Score: 2.

Match against affinity of fs_read: fs, read, leggi, lettura, file, ... No match. Score: 0.

web_fetch wins with high confidence.

Adaptive K

K (number of executors to pass to the LLM) is not fixed: it depends on the confidence with which the pre-filter distinguishes the top-1 from the others.

Measured in the POC: pre-filter sub-millisecond up to 300+ executors; LLM Qwen3:8b chooses correctly up to K=200; latency grows linearly with K above 40. Practical sweet spot: K=20-40 (~3 seconds of LLM latency).

The form with semantic embedding (local MiniLM model, ~100 MB) is deferred: the bag-of-words proved sufficient for the real cases of the POC.

6. Native tool-use (no JSON parsing)

Modern LLMs (Anthropic, OpenAI from the start; Ollama for Qwen 2.5/3, Llama 3.1+, Mistral, Gemma) support a native tool-calling protocol. The runtime declares the available tools in the API, each with its JSON Schema taken from the manifest. When the LLM decides to call a tool, it returns a structured field:

{
  "tool_calls": [
    {
      "id": "call_abc123",
      "function": {
        "name": "fs_read",
        "arguments": {"path": "/tmp/note.txt", "tail_bytes": 100}
      }
    }
  ]
}

It is already a Python dict, parsed by the HTTP protocol. No regex on text, no markdown blocks to extract, no edge cases of "the LLM forgot to close the brackets". When the LLM is ready for the final answer, it simply does not call any tool and produces only text: the planner recognises this as final_answer.

Consequence for manifests. The manifest of an executor declares the args in JSON Schema (section [args]) and that JSON Schema is passed directly to the provider as parameters of the tool. A single source of truth on the shape of the arguments, zero translation. See executor.html for the manifest details.

7. Data piping between steps: {{stepN.field}}

In multistep, the LLM at step N+1 needs to refer to the output of step N. Classic example: "fetch X and save it in Y" — at step 2, fs_write must receive as content the body returned by web_fetch at step 1.

The syntax is {{stepN.field}}:

The runtime intercepts the args before invoking the executor, discovers the placeholders, retrieves the real value from the turn history, substitutes.

Example: fetch and save
// Step 1
{ "tool": "web_fetch", "args": { "url": "https://httpbin.org/get" } }
// observation: {ok: true, content: "", metadata: {...}}

// Step 2 (proposed by the LLM)
{ "tool": "fs_write", "args": {
    "path": "/tmp/out.txt",
    "content": "{{step1.content}}"
} }

// The runtime substitutes and invokes:
{ "tool": "fs_write", "args": {
    "path": "/tmp/out.txt",
    "content": ""
} }
The syntax holds ONLY in the args. In the text of the final_answer write the actual values (e.g. "I wrote 173 bytes"); NEVER {{step1.bytes_written}}. The planner's prompt explicitly instructs the LLM about this limit. A violation of this rule was one of the bugs caught by the POC.

The reference must be the sole value of the arg, not interpolated within longer strings (limit of v1.1; a future extension may allow "prefix {{step1.content}} suffix").

Where this syntax comes from

It is the result of an architectural choice. When the POC started without any data-piping mechanism, the LLM tried to invent a shell-style syntax ($(web_fetch(...)['content'])) that the runtime could not interpret: the file ended up containing that literal string instead of the data. It was an open architectural tension: a mechanism was needed, among five possible alternatives (verbatim copy, named variables, pipe primitives, defer multistep). The {{stepN.field}} syntax is the simplest one that works: 30 lines of template substitution in the runtime, a brief instruction in the prompt, and the LLM Qwen3:8b follows it zero-shot.

8. Scratchpad for large observations

When an executor returns a lot of content (a 100 KB file, a long HTML page, a verbose API body), passing it whole into the LLM history blows up the context. Cutting at 1500 characters loses useful information.

The v1.1 solution: scratchpad. When an observation exceeds the threshold (4 KB of serialised JSON), the runtime saves it to a local SQLite (~/.local/share/metnos/scratchpad.db) and puts in the history a synthetic observation:

{
  "ok": true,
  "scratchpad_id": "eae04122bd704636",
  "size_bytes": 14144,
  "kind": "text",
  "summary": "hello this is a test note\n\n[... 13900 characters omitted ...]\n\nINFO 2026-04-26 23:59:59 LAST_CRITICAL_EVENT\n",
  "metadata": {"path": "/tmp/big_log.txt", "bytes": 14144, ...}
}

The summary is a smart truncation: the first 500 characters + a placeholder with the number of characters omitted + the last 500. This way the LLM sees the start and end of the content.

The LLM at the next step, seeing the summary, decides:

scratchpad_read is a builtin: it lives in the runtime, has no manifest on disk, is added to the tool catalog dynamically when active scratchpad entries exist in the current turn.

Full details in the dedicated doc: scratchpad.html.

9. Vaglio of the plan

The vaglio is the constitutional evaluator: before a tool_call becomes action, it decides whether it is lawful. In multistep it runs between one step and the next; in single-shot it runs post-hoc on the entire plan. Since 27/4 the vaglio is real (no longer a stub) and works in two distinct phases as per ch. 11 of the Architecture.

9.1 Guardia (binary)

Blocks violations of the 4 Laws. In v1.1 the encoded rules are:

The list does not relax with autonomy level: it is the "non-negotiable core" of ch. 5. If the guardia stops, no score: the verdict is blocked_by="guard".

9.2 Giudice (graduated)

If the guardia lets it through, the giudice measures the alignment of the action to the user's telos in [0, 1]. Below the threshold METNOS_JUDGE_THRESHOLD (default 0.30, configurable via env) the action is denied with blocked_by="judge". Above, it is approved.

In v1.1 the giudice is rule-based: local heuristics, microseconds, zero cost. Base score 0.7, bonus if the intent mentions the executor name (signal of explicit intent), penalty for .. in path (possible path traversal), penalty for keys in args with non-alphanumeric characters (anomaly). The LLM giudice (middle tier, separate context from the proposer to avoid self-confirmation) is deferred to v1.2: it requires middle tier configured + explicit budget. The deontology/teleology split is already in the right place, the giudice's implementation can evolve without touching the guardia.

The Verdict exposed by the vaglio module contains {approved, reason, score, blocked_by, judge_kind, ts}. The JSONL log on ~/.local/share/metnos/vaglio/YYYY-MM.jsonl records only the keys of args (not the values), for privacy.

10. Safety caps and runtime guards

The planner has three safety mechanisms against loops and ill-posed actions. They appeared in the POC as a response to real LLM behaviour, not as abstract concern.

MechanismWhat it doesDefault
cap_stepsMaximum number of steps per turn.5
cap_same_executorLimit of calls of the same executor in the turn.2
guard duplicate readIf the LLM re-invokes fs_read/fs_write/web_fetch with the same path/url as a previous step, the runtime does not re-execute: it returns an observation that says "you already have this data at step X, formulate the final_answer".active

The guard duplicate read emerged from the POC: without it, small LLMs (Qwen3:8b in particular) tended to re-read the same file with slightly different args hoping for a better result, ending up in cap_same. The guard intercepts upstream and unblocks the formulation.

Exception: scratchpad_read is not subject to the guard, because calling it more than once with different mode/range on the same scratchpad_id is the normal use case.

11. What it writes to logs

For each turn the planner writes a JSONL line to ~/.local/share/metnos/turns/YYYY-MM-DD.jsonl with:

JSONL append-only, one file per day. No automatic rotation in v1.1 (with normal use ~3 MB/month, negligible).

11.1 Writing into the mnestome

In parallel with the JSONL log, the planner updates the mnestome (SQLite, single file). Two simple hooks, activated only when there is observed piping between steps:

Without piping no mnest: the isolated invocation of a single executor does not represent a «passage between A and B» in the sense of ch. 2 of mnest. The write is fail-safe: an error on the mnestome is logged (verbosely) but does not interrupt the turn.

11.1.1 Synt-on-the-fly: immediate suggestion to the planner

When a proto-mnest is just registered (the case above: nonexistent tool with piping from a previous step), the planner immediately tries a reactive synthesis compose-only by calling Synt.react(req) with router=None (the generate mode stays reserved for the nightly scheduler and the introspective cycles of ch. 11.2). The outcome of this call, if positive, is added to the observation as a synt field:

Cost: one call to Composer.find_chain() (BFS on the mnestome, milliseconds) and a possible lock-check. No LLM, no budget. The call is fail-safe: any synt exception does not propagate and the observation remains the default one ("nonexistent executor: X"). See runtime/agent_runtime.py:_try_synt_compose.

11.2 Builtin scheduler

The scheduler is a builtin of the runtime, not an executor: it has no capability, is not signed, has no sandbox. It is a cron-style loop that runs recurring system tasks without user input. Three schedule supports:

State persisted in workspace/.scheduler/state.sqlite (table tasks with last_run_at + last_status, table runs append-only). The due-time check is idempotent: two ticks in the same slot do not run the task twice.

Built-in tasks registered by default:

NameScheduleWhat it does
apply_agerdaily@04:00Calls Mnestoma.apply_ager(): decay + demote + proto purge on the mnestome.
synt_suggestdaily@04:30For each recurring proto-mnest (uses≥3, weight≥0.30) calls Synt.react() in compose-only and logs the outcome.

The daemon loop (scheduler daemon) is a single process that runs tick() every 60s and goes idle. No concurrency, no inter-process locking: the scheduler in v1.1 is a singleton on metnos-server. The error policy is "do not let the loop fall": every task exception is caught, marked in last_status='error' with traceback in last_output, and the tick continues onto the following tasks.

The scheduler design as builtin is consistent with the decision in the Dialogue on executors: system maintenance (decay, nightly synthesis) is part of the runtime, not an executor that the system "decides to call". See also the memory builtin executors proposals for the future builtin triad (scheduler, ager, snapshot).

12. When things go wrong

The POC verified five typical failure modes. They became permanent test cases of the test framework.

What breaksWhat happens
Ollama unreachableThe provider raises ProviderError; the planner ends the turn with final_kind=error and a clear message.
Executor crashes (uncaught Python exception)The subprocess exits with stderr, the runtime returns {ok: false, error: "non-JSON output: …; stderr: …"}.
Executor code modified after signingThe loader rejects it at load time (digest mismatch). The executor never enters the catalog.
Executor returns non-JSON stdoutThe runtime detects it and returns {ok: false, error: "non-JSON output"}.
Empty catalogThe turn ends cleanly with final_kind=error, message "(empty catalog)".

The principle: do not let the system fall; always return a structured response, even when it is an error. The user understands, the LLM at the next step (in multistep) can correct, downstream executors do not see strange input.

13. What is deferred to v1.2+

FeatureWhen it lands
Real online modeWhen an Anthropic provider with API key is configured.
hybrid mode with real escalation rulesWhen the vocabulary of critical_capabilities is decided and auto-routing is desired.
Probabilistic vaglio (LLM-judge)When constitution.html v1.1 exists as reference.
Auto-escalation of tier (fast → middle → wise if steps fail)When at least two distinct tiers are configured and a use case motivates it.
Per-tier prompt differentiation (fast/middle/wise) even with the same modelWhen the use case requires multiple points of view on the same situation.
Parallel-tool-call within a turnWhen a real case shows significant latency win. Binding constraint v1.2.
Async approval (deferred execution with durable per_target grants)When channel + scheduler require asynchronous interaction.
Mnestome history-driven in the pre-filter (boost from history)When the operational mnestome exists.
Embedding (local MiniLM) in the pre-filterWhen bag-of-words shows practical limits (not surfaced in the POC).
Automatic replay of orphan turns from a crashWhen executors are guaranteed idempotent.
Automatic JSONL rotationWhen the volume becomes relevant.

Closing notes

This document describes what the planner does today, validated by tests. The "deferred to v1.2" sections describe what we expect, not promises. When an extension lands, this doc will be updated exactly as POC v1.1 updated the previous version: rather than speculating on how it will be, we will write what works.