fast_path.py)canonical_matcher)multi_tool_paths)When a request reaches Metnos, before bothering the LLM planner we try to recognise it. If we have already seen it and we have already decided well how to answer, we replay it and return to the user in half a second instead of ten. No language models on the critical path, just memory.
The Metnos planner is a local Gemma 4 26B: it thinks well but takes
about twelve seconds to decide the first step of a turn. For many
requests this wait is disproportionate. “What time is it?”
does not need an LLM: it needs get_now. Even
“download this page and describe it in two lines” does
not need the planner if it is a request we make often and whose
sequence we already know: get_urls followed by
describe_entries.
Hence the question: how do we recognise that a request is already known? And how do we accept that almost known is enough? The answer is the introvertive fast-path, organised in three layers — each captures a different degree of similarity, all converging to the same outcome: the assistant executes the right sequence without thinking twice.
| Layer | What it recognises | How | Cost per hit |
|---|---|---|---|
| L0 | Well-known literal phrases (“what time is it”, “where am I”) | Closed table of normalised patterns | microseconds |
| L1 | A request semantically equivalent to one already seen, with ONE executor | BGE-M3 cosine over the canonical query log | ~30 ms |
| L2 | A multi-step pipeline already executed before, with the same intent | BGE-M3 cosine over the sequences log + placeholder resolution | ~30 ms + N invocations |
The flow is strictly sequential: first L0 is attempted, then L1 if L0 missed, and only if L1 also misses is L2 tried. If all three miss, the LLM planner (L3) takes over as always.
The lowest layer lives in runtime/fast_path.py and
contains a handful of hand-coded text patterns. They are the requests
that recur for everyone, said in a thousand interchangeable ways: the
time, the date, one's own location, undo. For each we have listed the
Italian and English variants we realistically expect (“what time
is it”, “what's the time”, “tell me the
time”, “che ora è”, “che ore
sono”...) and the canonical executor to invoke.
When a request comes in, we normalise it (lowercase, straight apostrophes, punctuation off) and look for an exact match in the dictionary. If there is one, the executor starts immediately; the answer comes out formatted by a small Italian/English template. Zero language model, zero network, zero perceived delay.
It is a closed table: it is extended only by adding new patterns, never with complicated regular expressions. The golden rule: when in doubt whether it fits, don't add it. Conservative by construction.
canonical_matcher)The second layer takes care of requests with ONE single executor of first choice: “find the baptism photos”, “show me today's appointments”, “read this page”. These are cases where we cannot enumerate by hand all possible variants, because in the meantime the user changes topics, order, lexicon. The planner in practice would always call the same executor: we want to reach the same conclusion without the long detour.
To achieve this, every time the planner decides the first step of a
turn, Metnos records into the canonical_query_log table
in the mnestoma a row with three things: the request in lemma form
(canonical, lowercase, condensed version), the chosen executor, and
the values of the arguments it actually used. From that moment on,
that row is a candidate.
When a new request comes in, we transform it into its semantic vector with BGE-M3 (the same multilingual model we use for affinities) and compute the cosine against all candidates with at least three observations. If the closest candidate exceeds the 0.95 cosine threshold, we accept it: the new request is served with the executor from the matched row, and the arguments are extracted by the deterministic extractor (see §10) or reused from the values stored at the first run.
~/.config/metnos/runtime.toml.
multi_tool_paths)
Some requests require more than one executor in sequence, with the
result of the first feeding the second. “Download that page
and tell me what it says” is two steps: get_urls
followed by describe_entries. Here too the pattern
repeats: for frequent pipelines the planner always ends up emitting
the same sequence, but with twelve seconds of latency each time.
The third layer memoises the sequences. At the end of every
successful turn with two or more steps, Metnos records into
multi_tool_paths.sqlite the pair (canonical query,
executor sequence + argument placeholders). After three identical
passes, the pipeline becomes a candidate. From that moment on, a new
request similar enough (cosine ≥ 0.88) is recognised and replayed
in full: no planner, no interruption.
The 0.88 threshold is more permissive than 0.95 because multi-step pipelines have more lexical variants of the same intent (“download and describe”, “read and summarise”, “give me the gist” are the same thing from the execution point of view). The protection against false positives is not the threshold, but placeholder resolution: if the new request does not carry the data the pipeline needs (for example there is no URL in the text but the pipeline requires one), the match fails and we fall back to the planner. No harm done.
enabled = false): it gets turned on when the
rows are sufficient to be useful.
What happens to a memoisation that goes unused for months? If Roberto
leaves for two months on holiday, his pipelines should not be deleted:
they will serve him again on his return. For this reason the rows of
multi_tool_paths do not expire on the calendar, but in
days of effective activity.
The system_active_days table records a row for each day
in which Metnos saw at least one turn. Each row carries a
day_rank, a monotonically increasing counter: the first
useful day is 1, the second is 2, and so on, skipping empty days.
Every memoisation stores the day_rank of the last time
it was useful. When the difference between the current day and that
value exceeds N (30 by default), the row is removed.
The interesting consequence: two months of absence don't count. The counter advances only on days of use. A TTL measured in “days of the assistant's real life”, not in “Gregorian calendar days”.
There is an interesting cascading effect. When a turn contains steps already served by the fast-path (because some sub-pieces of the request are known), Metnos does not discard them from L2 recording. The complete sequence — including steps handled by L0 or L1 — is memoised for what it is: a pipeline that worked. After three passes, that combination too becomes a multi-step fast-path.
It means the system learns to compose: two adjacent fast-paths become a longer fast-path, recursively. The historical decision is already approved for each step, composing several together adds no risk, only determinism.
An optional variant (off by default today) allows the L2 fast-path to hand over to the planner after executing its sequence. It is the case where the request has multiple verbs: “download me this page and then send me the summary by email”. The prefix “download and describe” is a known L2; but “send me by email” is residual intent.
If the chain_to_planner switch is on and the continuation
detector (multi-verb, conjunctions) finds further actions in the
request, Metnos executes the prefix with the fast-path, records it in
the turn's working memory, and yields to the planner for the remaining
steps. The planner sees an already populated history and decides only
what is missing, starting from the next step.
Recognising the request is half the work. The other half is
reconstructing the concrete argument values: which paths, which URLs,
which date, which threshold. For this Metnos has a small deterministic
extractor (args_extractor.py) that works by rules, in
three steps:
canonical_query_log has values observed on the first run
(column args_observed), we reuse them as-is.https://...), path
(~/... or /..., with the “home”
shortcut becoming ~/), email, numbers, file extensions
(“PDF file” becomes *.pdf), dates
(today/yesterday/tomorrow/day-after-tomorrow in Italian and English,
mapped to ISO format), time windows (“this week”,
“last 24 hours”, “last 7 days”).canonical_query_args_llm
switch is on, the Gemma fast tier is called with a constrained schema
and a minimal prompt. The result goes into an on-disk cache so the
call is not repeated. Off by default: experience says the rules cover
about 90% of real requests.
All fast-path parameters live in a TOML file at
~/.config/metnos/runtime.toml, generated at first boot
by the setup. A METNOS_* environment variable always
takes precedence, for debugging and benchmarks; the TOML file
establishes the persistent value across restarts, and the default
hardcoded in the module is the last safety net.
[fast_path] canonical_query_enabled = true canonical_query_min_uses = 3 canonical_query_threshold = 0.95 canonical_query_args_llm = false [multi_tool_fast_path] enabled = false chain_to_planner = false threshold = 0.88 min_uses = 3 ttl_active_days = 30 k_synth = 50
There is one last step that closes the circle. When an L2 pipeline is
used often enough to deserve different treatment (50 uses, the
threshold is configurable), a nightly job
(multi_tool_promote) promotes it to a candidate for full
synthesis. It creates a proto-mnest in the mnestoma, with the desired
signature (executor sequence, placeholders, canonical query), and
from that moment on the synthesis engine can produce ONE new unified
executor that wraps the whole pipeline. The promotion is idempotent:
a row already promoted is ignored on subsequent runs.
The bridge gives us two coexisting tiers: L2 captures low volume (sequences seen 3-50 times, deterministic replay via N executor invocations), while L3 takes over when the pattern is stably recurring and worth the investment of synthesising a monolithic executor that does everything in one shot. They do not exclude each other: they complement.
Some points are deliberately left for the future, because they become interesting only with a bit of volume.
introvertiva: we could generate
automatic promotion proposals, identify clusters of similar pipelines,
suggest unified executors. For now the introvertiva module does not
read the memoisations; it will in a later phase.© 2026 Roberto Brunialti · Metnos documentation · Reference ADRs: 0094 / 0149 / 0150