← Documentation index Microarchitecture › introvertive fast-path

Metnos

introvertive fast-path — how the assistant learns to answer instantly
Microarchitecture v1.0 — 19 May 2026
Aligned with ADR 0094 (literal fast_path), 0149 (canonical_matcher),
0150 (multi-tool memoization).

Audience: anyone who wants to understand why Metnos sometimes answers
in half a second, and sometimes takes ten.
Reading time: 12 minutes.

Table of contents

  1. The idea, in two lines
  2. Why a fast-path is necessary
  3. Three layers, from the simplest to the richest
  4. L0 — literal patterns (fast_path.py)
  5. L1 — a single executor (canonical_matcher)
  6. L2 — multi-step sequences (multi_tool_paths)
  7. A TTL that measures use, not time
  8. Composing fast-paths: two hold hands
  9. Chaining the fast-path to the planner
  10. The arguments extractor
  11. Persistent configuration
  12. Promotion to a synthesised executor
  13. Open questions

1. The idea, in two lines

When a request reaches Metnos, before bothering the LLM planner we try to recognise it. If we have already seen it and we have already decided well how to answer, we replay it and return to the user in half a second instead of ten. No language models on the critical path, just memory.

2. Why a fast-path is necessary

The Metnos planner is a local Gemma 4 26B: it thinks well but takes about twelve seconds to decide the first step of a turn. For many requests this wait is disproportionate. “What time is it?” does not need an LLM: it needs get_now. Even “download this page and describe it in two lines” does not need the planner if it is a request we make often and whose sequence we already know: get_urls followed by describe_entries.

Hence the question: how do we recognise that a request is already known? And how do we accept that almost known is enough? The answer is the introvertive fast-path, organised in three layers — each captures a different degree of similarity, all converging to the same outcome: the assistant executes the right sequence without thinking twice.

3. Three layers, from the simplest to the richest

LayerWhat it recognisesHowCost per hit
L0Well-known literal phrases (“what time is it”, “where am I”)Closed table of normalised patternsmicroseconds
L1A request semantically equivalent to one already seen, with ONE executorBGE-M3 cosine over the canonical query log~30 ms
L2A multi-step pipeline already executed before, with the same intentBGE-M3 cosine over the sequences log + placeholder resolution~30 ms + N invocations

The flow is strictly sequential: first L0 is attempted, then L1 if L0 missed, and only if L1 also misses is L2 tried. If all three miss, the LLM planner (L3) takes over as always.

Note. The planner does not disappear: the fast-path is an additional route, not a substitute one. When the request is new, ambiguous, or below the threshold of any of the searches, the planner takes the wheel as if the fast-path were not there.

4. L0 — literal patterns

The lowest layer lives in runtime/fast_path.py and contains a handful of hand-coded text patterns. They are the requests that recur for everyone, said in a thousand interchangeable ways: the time, the date, one's own location, undo. For each we have listed the Italian and English variants we realistically expect (“what time is it”, “what's the time”, “tell me the time”, “che ora è”, “che ore sono”...) and the canonical executor to invoke.

When a request comes in, we normalise it (lowercase, straight apostrophes, punctuation off) and look for an exact match in the dictionary. If there is one, the executor starts immediately; the answer comes out formatted by a small Italian/English template. Zero language model, zero network, zero perceived delay.

It is a closed table: it is extended only by adding new patterns, never with complicated regular expressions. The golden rule: when in doubt whether it fits, don't add it. Conservative by construction.

5. L1 — a single executor (canonical_matcher)

The second layer takes care of requests with ONE single executor of first choice: “find the baptism photos”, “show me today's appointments”, “read this page”. These are cases where we cannot enumerate by hand all possible variants, because in the meantime the user changes topics, order, lexicon. The planner in practice would always call the same executor: we want to reach the same conclusion without the long detour.

To achieve this, every time the planner decides the first step of a turn, Metnos records into the canonical_query_log table in the mnestoma a row with three things: the request in lemma form (canonical, lowercase, condensed version), the chosen executor, and the values of the arguments it actually used. From that moment on, that row is a candidate.

When a new request comes in, we transform it into its semantic vector with BGE-M3 (the same multilingual model we use for affinities) and compute the cosine against all candidates with at least three observations. If the closest candidate exceeds the 0.95 cosine threshold, we accept it: the new request is served with the executor from the matched row, and the arguments are extracted by the deterministic extractor (see §10) or reused from the values stored at the first run.

Three observations minimum, not one. Because a single observation might be noise: the user typed a strange phrase, clicked by mistake, asked something that will not come back. Three passes consolidate the pattern. The threshold is configurable in ~/.config/metnos/runtime.toml.

6. L2 — multi-step sequences (multi_tool_paths)

Some requests require more than one executor in sequence, with the result of the first feeding the second. “Download that page and tell me what it says” is two steps: get_urls followed by describe_entries. Here too the pattern repeats: for frequent pipelines the planner always ends up emitting the same sequence, but with twelve seconds of latency each time.

The third layer memoises the sequences. At the end of every successful turn with two or more steps, Metnos records into multi_tool_paths.sqlite the pair (canonical query, executor sequence + argument placeholders). After three identical passes, the pipeline becomes a candidate. From that moment on, a new request similar enough (cosine ≥ 0.88) is recognised and replayed in full: no planner, no interruption.

The 0.88 threshold is more permissive than 0.95 because multi-step pipelines have more lexical variants of the same intent (“download and describe”, “read and summarise”, “give me the gist” are the same thing from the execution point of view). The protection against false positives is not the threshold, but placeholder resolution: if the new request does not carry the data the pipeline needs (for example there is no URL in the text but the pipeline requires one), the match fails and we fall back to the planner. No harm done.

Even L2 needs a starting point. On day one the table is empty: the system accumulates observations while the user does normal things, and after a week of daily use it starts recognising recurring pipelines. For this reason L2 is born with the switch off by default (enabled = false): it gets turned on when the rows are sufficient to be useful.

7. A TTL that measures use, not time

What happens to a memoisation that goes unused for months? If Roberto leaves for two months on holiday, his pipelines should not be deleted: they will serve him again on his return. For this reason the rows of multi_tool_paths do not expire on the calendar, but in days of effective activity.

The system_active_days table records a row for each day in which Metnos saw at least one turn. Each row carries a day_rank, a monotonically increasing counter: the first useful day is 1, the second is 2, and so on, skipping empty days. Every memoisation stores the day_rank of the last time it was useful. When the difference between the current day and that value exceeds N (30 by default), the row is removed.

The interesting consequence: two months of absence don't count. The counter advances only on days of use. A TTL measured in “days of the assistant's real life”, not in “Gregorian calendar days”.

8. Composing fast-paths: two hold hands

There is an interesting cascading effect. When a turn contains steps already served by the fast-path (because some sub-pieces of the request are known), Metnos does not discard them from L2 recording. The complete sequence — including steps handled by L0 or L1 — is memoised for what it is: a pipeline that worked. After three passes, that combination too becomes a multi-step fast-path.

It means the system learns to compose: two adjacent fast-paths become a longer fast-path, recursively. The historical decision is already approved for each step, composing several together adds no risk, only determinism.

9. Chaining the fast-path to the planner

An optional variant (off by default today) allows the L2 fast-path to hand over to the planner after executing its sequence. It is the case where the request has multiple verbs: “download me this page and then send me the summary by email”. The prefix “download and describe” is a known L2; but “send me by email” is residual intent.

If the chain_to_planner switch is on and the continuation detector (multi-verb, conjunctions) finds further actions in the request, Metnos executes the prefix with the fast-path, records it in the turn's working memory, and yields to the planner for the remaining steps. The planner sees an already populated history and decides only what is missing, starting from the next step.

10. The arguments extractor

Recognising the request is half the work. The other half is reconstructing the concrete argument values: which paths, which URLs, which date, which threshold. For this Metnos has a small deterministic extractor (args_extractor.py) that works by rules, in three steps:

  1. Memory. If the row matched in the canonical_query_log has values observed on the first run (column args_observed), we reuse them as-is.
  2. Rules. Otherwise we apply closed regular expressions for known types: URL (https://...), path (~/... or /..., with the “home” shortcut becoming ~/), email, numbers, file extensions (“PDF file” becomes *.pdf), dates (today/yesterday/tomorrow/day-after-tomorrow in Italian and English, mapped to ISO format), time windows (“this week”, “last 24 hours”, “last 7 days”).
  3. LLM, only if needed. If mandatory arguments remain uncovered, and the canonical_query_args_llm switch is on, the Gemma fast tier is called with a constrained schema and a minimal prompt. The result goes into an on-disk cache so the call is not repeated. Off by default: experience says the rules cover about 90% of real requests.

11. Persistent configuration

All fast-path parameters live in a TOML file at ~/.config/metnos/runtime.toml, generated at first boot by the setup. A METNOS_* environment variable always takes precedence, for debugging and benchmarks; the TOML file establishes the persistent value across restarts, and the default hardcoded in the module is the last safety net.

[fast_path]
canonical_query_enabled    = true
canonical_query_min_uses   = 3
canonical_query_threshold  = 0.95
canonical_query_args_llm   = false

[multi_tool_fast_path]
enabled            = false
chain_to_planner   = false
threshold          = 0.88
min_uses           = 3
ttl_active_days    = 30
k_synth            = 50

12. Promotion to a synthesised executor

There is one last step that closes the circle. When an L2 pipeline is used often enough to deserve different treatment (50 uses, the threshold is configurable), a nightly job (multi_tool_promote) promotes it to a candidate for full synthesis. It creates a proto-mnest in the mnestoma, with the desired signature (executor sequence, placeholders, canonical query), and from that moment on the synthesis engine can produce ONE new unified executor that wraps the whole pipeline. The promotion is idempotent: a row already promoted is ignored on subsequent runs.

The bridge gives us two coexisting tiers: L2 captures low volume (sequences seen 3-50 times, deterministic replay via N executor invocations), while L3 takes over when the pattern is stably recurring and worth the investment of synthesising a monolithic executor that does everything in one shot. They do not exclude each other: they complement.

13. Open questions

Some points are deliberately left for the future, because they become interesting only with a bit of volume.

© 2026 Roberto Brunialti · Metnos documentation · Reference ADRs: 0094 / 0149 / 0150