vaglio — the constitutional evaluator in two phases
Between the moment when the LLM says «I want to call shell_exec with these arguments» and the moment the runtime executes the process, there is a mandatory checkpoint: the vaglio. It decides whether that proposal can become action. It is not an opinion, it is a filter: if the vaglio denies, the plan does not proceed.
The vaglio applies two distinct phases, and this distinction is the first thing worth grasping. The first phase is a binary guard: there is a core of violations that are not up for debate — touching ~/.ssh, executing rm -rf /, writing on /etc/passwd. If one of these patterns shows up, the action is denied, full stop. No nuance, no score, no «but the context». The second phase is a graded judge: it measures how aligned the action is with the user's telos in [0, 1]. Below threshold, denied. Above, approved.
Why two separate phases? Because otherwise a specific phenomenon happens, described in ch. 11.1 of the Architecture as model self-confirmation (self-enhancement bias): if you put deontology (what is permissible) and teleology (what serves the user) into a single score, a teleologically convinced judge will tend to justify even what should be blocked upstream. «Yes, I am deleting /etc, but deep down the user wanted cleanup.» Keeping them separate, the guard shuts the door before the judge can rationalise.
The separation is also a promise of architectural stability: the guard today is a list of encoded regexes; tomorrow, when the judge becomes an LLM call (v1.2), the guard will stay the same. The non-negotiable core does not pass through a model.
Verdict
Every call to the vaglio returns a Verdict object. It is a small dataclass (runtime/vaglio.py:74-81):
@dataclass
class Verdict:
approved: bool
reason: str
ts: float = field(default_factory=time.time)
judge_kind: str = "rule-based-v1"
score: float = 1.0 # relevant only if the guard let it through
blocked_by: str | None = None # "guard" | "judge" | None
| Field | Type | Meaning |
|---|---|---|
approved | bool | Final outcome: True if the action may proceed, False otherwise. |
reason | str | Readable explanation. For a guard block: "guard: forbidden path violated: ...". For a judge block: "judge: score 0.20 < threshold 0.30 (...)". |
ts | float | Unix timestamp of the decision. |
judge_kind | str | Family of judge used. "rule-based-v1" (default) or "llm-v1" (opt-in via env METNOS_JUDGE_KIND=llm-v1, since the evening of Apr 27). |
score | float | Alignment score in [0, 1]. Meaningful only if the guard let it through. For guard blocks, value is 0.0. |
blocked_by | str|None | Which phase denied: "guard", "judge", or None if approved. |
The orchestrator judge(intent, executor_name, args, context) (runtime/vaglio.py:186-214) is the only public signature. It receives the user's intent (the initial query of the turn), the name of the proposed executor, the validated args, and an optional context dictionary (mode, capability, step number). It returns a single Verdict.
The guard has no nuances. It performs two checks in sequence, and stops at the first violation (runtime/vaglio.py:104-126).
The guard extracts all string values from args, even nested ones (_flatten_str_values, runtime/vaglio.py:86-97), expands the tilde with the user's path (_expand_user, runtime/vaglio.py:100-101), and compares them with a list of encoded regex patterns (runtime/vaglio.py:45-58):
_FORBIDDEN_PATH_PATTERNS = [
re.compile(r"(^|/)\.ssh(/|$)"),
re.compile(r"^/etc/(passwd|shadow|sudoers)"),
re.compile(r"^/etc/ssh(/|$)"),
re.compile(r"^/root(/|$)"),
re.compile(r"^/boot(/|$)"),
re.compile(r"^/sys(/|$)"),
re.compile(r"^/proc(/[0-9]|$)"),
re.compile(r"^/dev/(sd|nvme|mmcblk|loop)"),
re.compile(r"\.aws/credentials"),
re.compile(r"\.config/[^/]+/credentials\.env"),
re.compile(r"\.gnupg(/|$)"),
]
The subtle point: the guard does not distinguish between «I am reading ~/.ssh/id_rsa» and «I am just passing the string ~/.ssh/id_rsa in some parameter or other». If the pattern shows up, it blocks. This is deliberate: it means that even an executor that mentions a forbidden path in args (for example in a "glob" or "exclude" field) gets stopped. Conservative? Yes. But the cost of a false positive is a reformulated request; the cost of a false negative is an SSH key leaving.
This applies only if executor_name == "shell_exec" or if the context declares capability == "code:exec" (runtime/vaglio.py:117-124). The value checked is the command field (or alternatively cmd); if it is a list, it gets joined with spaces before matching. The pattern list (runtime/vaglio.py:62-69):
_DANGEROUS_SHELL_PATTERNS = [
re.compile(r"\brm\s+-rf?\s+/(\s|$)"), # rm -rf /
re.compile(r"\brm\s+-rf?\s+~(\s|$|/)"), # rm -rf ~
re.compile(r"\bmkfs\b"),
re.compile(r"\bdd\s+.*\bof=/dev/"),
re.compile(r":\s*\(\s*\)\s*\{\s*:\|:&\s*\}"), # fork bomb
re.compile(r"\bchmod\s+-?R?\s*0?77[0-9]\s+/"),
]
The motivation is Law 1 of the constitution (see constitution): «no irrecoverable state». A rm -rf / cannot be undone with an undo; a mkfs on the system drive neither; a fork bomb saturates the process. They are pathologies distinct from any ordinary file deletion: they take the system out of a state in which the system itself can still respond.
Here too the match is textual. The guard does not emulate the shell: it looks for literal patterns in the command. This means that variants cleverly built with escape characters can pass; the defence in depth is the sandbox with filesystem namespacing for shell_exec, of which the guard is only the first layer.
If the guard lets it through, the judge starts. In v1.1 the judge is rule-based: no LLM call, microsecond latency, zero cost. The function is judge_score(intent, executor_name, args, context) (runtime/vaglio.py:131-171) and returns (score, rationale).
It starts from a base score of 0.7: the v1.1 regime is benevolent by construction (see ch. 9). From there the judge adds or subtracts small adjustments, then clamps to [0, 1].
| Signal | Effect on score | Rationale |
|---|---|---|
| The user's intent mentions the executor name (even just one of its tokens, length ≥ 3). | +0.10 | Sign of explicit intent. If the user wrote «read the file» and the executor is fs_read, the match on read is a weak but real signal that the action matches the request. |
Any string value in args contains .. and / (possible path traversal). | −0.20 | Coarse but effective heuristic: many filesystem attacks use .. to escape the expected folder. The penalty is the largest because it indicates an attempt to cross a boundary. |
An args key does not respect Python identifier rules (non-alphanumeric / underscore characters). | −0.10 | Shape anomaly. A healthy executor declares args with normal keys; an odd key suggests either injection or a planning error. |
The context declares critical: false (non-critical executor, read-only). | +0.05 | Modest bonus: read-only executors cannot produce irrecoverable state, so the bar can be slightly lower. |
The score is clamped to [0, 1] and compared against JUDGE_THRESHOLD (runtime/vaglio.py:40). Default: 0.30, configurable via env METNOS_JUDGE_THRESHOLD.
Intent: "read my notes in /tmp/n.txt". Executor: fs_read. Args: {"path": "/tmp/n.txt"}.
Starting score: 0.7. The judge splits fs_read on separators, gets the tokens fs and read; in English the latter (read) appears in the intent. Intent bonus applies (+0.10). Harmless path. Clean keys.
Final score: 0.80. Above threshold. approved=True, reason="approved: score 0.80 (intent matches executor)".
Args: {"path": "/tmp/../etc/foo"}. Suspicious path traversal: .. and / present.
Score: 0.7 − 0.2 = 0.5. Still above 0.30: approved nonetheless, but with the note "possible path traversal ('..' in path)" in the reason and in the log.
To actually block the action, the judge would need other cumulative negative signals, or a higher configured threshold. In v1.1 the judge marks; it does not block on its own except in edge cases.
Since the evening of Apr 27 (phase 5 kickoff) the vaglio can use an
LLM-based judge instead of the rule-based one. It is enabled via the env
variable METNOS_JUDGE_KIND=llm-v1 (default
rule-based-v1): opt-in because it costs — every judge
decision pays one call to the middle tier. Implementation in
runtime/vaglio.py:_judge_score_llm.
_LLM_JUDGE_SYSTEM), no memory shared with the
proposer. Explicit defence against model self-confirmation:
if the proposer could listen to its own justifications as judge
input, it would optimise them backwards. Separate sessions break
the loop.{"score": float [0,1], "reason": "max 25 words"}.
Robust regex-based parser: if the LLM produces invalid JSON,
fallback to 0.5 with rationale "llm-judge fallback (parse
fail)".LLMRouter is not
configured or the middle tier does not respond, the judge returns
score 0.5 with explicit rationale "llm-judge fallback (router
missing|chat failed)". The sequence is not blocked; the log
reveals that the analysis was degraded.JUDGE_KIND=llm-v1 active, the binary guard runs first:
forbidden path or dangerous shell block without consulting the LLM
at all. Guarantee: no LLM call wasted on cases already decidable
by local rules (test
judge_llm_guardia_blocca_prima_di_chiamare_llm).Rule-based is the default and is enough for most turns: the guard filters danger, the rule-based judge flags outliers. The LLM judge is for cases that need a finer assessment of telos alignment (e.g. proactive actions under bother budget, choice between equivalent options). Enable with:
METNOS_JUDGE_KIND=llm-v1 python3 -m channels.daemon
Tradeoff: every judge call pays a latency of ~800 ms (local qwen3:8b in middle-tier mode) or ~150 ms (Anthropic Sonnet). For a 3-step turn, that is 3 vaglio calls. A future cost policy will be able to automatically degrade to rule-based above a calls-per-day threshold.
METNOS_JUDGE_THRESHOLD is
calibrated for the rule-based (default 0.30); the LLM will produce
scores in a different distribution and will need its own separate
threshold.
Every call to the vaglio writes a JSONL line in ~/.local/share/metnos/vaglio/YYYY-MM.jsonl (monthly file, see runtime/vaglio.py:33 and 176-183). The write is fail-safe: if the filesystem refuses, the vaglio does not block the decision — it simply does not write the log.
{
"approved": false,
"reason": "guard: forbidden path violated: pattern '(^|/)\\.ssh(/|$)' in args",
"ts": 1714225863.214,
"judge_kind": "rule-based-v1",
"score": 0.0,
"blocked_by": "guard",
"intent": "read the key",
"executor": "fs_read",
"args_keys": ["path"],
"context_keys": ["mode", "step"]
}
args, not the values. This is explicit (runtime/vaglio.py:197 and 212): the log records the key names ("path", "command", ...) but not the values (the actual path, the actual command). Reason: privacy. Values often contain sensitive data (user paths, file contents, payloads), and a log that recorded them would itself become an asset to protect. The keys are enough for statistical analysis (which executors get blocked most often?); the live value lives in the turn events of the agent_runtime, with different retention rules.
Rotation is monthly per file. No compression, no encryption: the logs stay under the user's ~/.local. If long-term archiving is needed, it is the job of external processes (rsync, backup, etc.).
The call site lives in the planner, in runtime/agent_runtime.py:512-526. The filter sequence is strictly ordered:
# runtime/agent_runtime.py:512-526 (excerpt)
# Validation, sandbox, vaglio
validation = validate_args(args, executor.args_schema)
step.validation_failures = validation
if validation:
obs = {"ok": False, "error": f"validation failed: {validation}"}
else:
scope_violation = check_hints(args, executor.capabilities)
step.scope_violation = scope_violation
if scope_violation:
obs = {"ok": False, "error": scope_violation}
else:
verdict = judge(user_query, chosen_name, args,
{"mode": chosen_mode, "step": step_num})
step.vaglio_approved = verdict.approved
if not verdict.approved:
obs = {"ok": False, "error": f"vaglio rejects: {verdict.reason}"}
else:
...
obs = invoke_executor(executor, args)
The order is not random:
check_hints): is the executor stepping out of its declared perimeter? (Example: a read-only executor that writes.)judge(...)): only if the two previous checks pass. It receives the user's original intent (user_query), the chosen executor name, the validated args, and a context with mode and step_num.For multistep plans, the vaglio runs between one step and the next: every tool_call proposed by the LLM gets vagliato before execution, so that a 5-step plan produces 5 distinct verdicts. For single-shot (a plan proposed entirely in one go, without intermediate steps), the vaglio runs post-hoc: the plan is already formulated, but the vaglio can still block the execution of each tool_call.
The flag step.vaglio_approved ends up in the turn log (agent_runtime ch. 11), and from there in the spectator: the user can see after the fact which steps were vagliati and with what outcome.
The vaglio cluster is at 13/13 green. The cases are declared in runtime/testing/populate_cases.py (section --- vaglio ---) and cover guard, judge, log and privacy.
| Case | Category | What it verifies |
|---|---|---|
approva_path_innocuo | happy | A normal path (/tmp/x) passes, judge_kind is "rule-based-v1", score in [0, 1]. |
guard_blocca_ssh | security | ~/.ssh/id_rsa blocked by the guard with blocked_by="guard" and reason "forbidden path". |
guard_blocca_etc_passwd | security | /etc/passwd blocked. |
guard_blocca_credentials_user | security | ~/.config/metnos/credentials.env blocked (check of the credentials.env pattern nested in user path). |
guard_blocca_rm_rf_root | security | shell_exec with rm -rf / blocked as irrecoverable. |
guard_blocca_fork_bomb | security | shell_exec with :(){ :|:& };: blocked. |
guard_lascia_passare_shell_innocua | happy | shell_exec with ls -la /tmp passes the guard. |
judge_intent_menziona_executor_alza_score | happy | Score with intent that mentions the executor > score with generic intent. |
judge_path_traversal_abbassa_score | edge | Path with .. receives lower score than the harmless path. |
judge_sotto_soglia_blocca | failure | With METNOS_JUDGE_THRESHOLD=0.99, any action gets denied with blocked_by="judge". |
log_jsonl_viene_scritto | happy | After judge(...), the file YYYY-MM.jsonl exists and contains the intent marker. |
args_keys_loggate_non_values | security | A sensitive value in args ("PASSWORD_segreto_...") does NOT appear in the log; only the key name ("path") is present. |
On top of these 13, integration in the agent_runtime cluster (172/172 green) exercises the vaglio at its real call site (validation + scope + vaglio + invoke), on end-to-end scenarios.
| Limit | Explanation | When it lifts |
|---|---|---|
| The vaglio does not read the mnestoma. | Separation-of-concerns decision: the v1.1 vaglio evaluates only the declared intent + executor + args, it does not look at history. Meaning: it cannot see «this same action was denied 5 minutes ago» or «the user usually does not operate this way». | v1.2: the LLM judge will receive context from the mnestoma (similar relevance, prior outcomes) to weigh better. |
| The approval_ux (card + buttons in the channel) is renderable, the callback dispatcher is not. | The card the user sees in the channels (Telegram, Slack, etc.) gets generated already via runtime/channels/approval.py. But if the user presses «Allow this once», the callback does not yet re-enter the live plan. | v1.2: callback dispatcher with re-injection into the current turn. |
| The rule-based judge is very benevolent. | Default threshold 0.30 and base score 0.7: even with several moderate penalties, the score stays above. Conscious: in v1.1 the real filter is the guard, the judge only marks outliers in the log. | v1.2 with LLM judge: the threshold will mean more, and the score distribution will be wider. |
| No structured explanation of refusal to the user. | When the vaglio denies, the planner shows the textual reason but does not build a card «why I said no and what you can do». The refusal UX is minimal. | v1.2: integration with approval_ux for dialogic refusals. |
| No interactive vaglio. | The v1.1 vaglio is a pure function: input args, output Verdict. It cannot ask «are you sure?» or request confirmation. | v1.2: integration with the dialog manager for authorisation requests (see approval_ux). |
The configuration of the v1.1 vaglio is minimal, and this minimality is deliberate.
| What | Where | Default | Notes |
|---|---|---|---|
METNOS_JUDGE_THRESHOLD | Environment variable. | 0.30 | Below this, the judge denies. Read at module import time (runtime/vaglio.py:40): to change it at runtime use importlib.reload(vaglio). |
VAGLIO_LOG_DIR | Module constant (not env). | ~/.local/share/metnos/vaglio/ | Changing the log destination would require a code edit. Deliberately not a parameter: the log is a system invariant, not a user preference. |
_FORBIDDEN_PATH_PATTERNS | Module constant, encoded. | 11 regexes (see ch. 3). | Not configurable. Modifying them requires a code edit and a new deploy. Deliberate: ch. 5 of the Architecture, «non-negotiable core». |
_DANGEROUS_SHELL_PATTERNS | Module constant, encoded. | 6 regexes (see ch. 3). | Not configurable. Same reasoning: the list of shell quasi-irrecoverable commands does not loosen. |
The vaglio is a small component (~210 lines of Python) but architecturally central. The two-phase shape is the price of robustness: a binary guard that is not up for debate, a graded judge that evolves. Today's guard will protect even from tomorrow's judge, because it will have proved stable across versions of the judge.
The microdesign of this module was validated by the code before the document: the module went live on April 27, 2026, passed 13 own tests and 172 integration tests, and this doc describes what runs. Not the other way around.