← Documentation index Architecture › model virtualization

Metnos

model virtualization — LLM · embedding · VLM

How to change the model without touching the code
Audience: anyone who wants to understand how Metnos picks and swaps its models.

Reading time: 10 minutes.

The problem: the model hard-wired in the code
The solution: ask for a role, not a model
The configuration files
Segregation: a single point to go through
Embedding autonomy
The lineage: a lightweight subset
Going deeper
What this means for you, in practice

1. The problem: the model hard-wired in the code

Metnos uses three families of models: an LLM that reasons and plans, an embedder that turns text and images into vectors for semantic search, and a VLM that looks at photos and describes them. The question this document answers is simple: when I want to swap one of these models, what do I have to touch?

Until recently the answer depended on the model. For the LLM a good solution already existed: a small layer (llm_router) read from a configuration file which model to serve, so changing it meant editing that file. For the embedder and the VLM, however, it did not: the various parts of the system imported the concrete class directly — bge_embedding here, clip_embedding there — and the VLM's address was written by hand in the code.

The flaw is not cosmetic, it is practical. Swapping the embedder meant hunting down every spot in the code that named it and changing them one by one: the semantic routing, the indexing, two nightly processes, the image executors. Ten different places, ten chances to miss one. And pointing the embedding at an external server? It simply was not possible: it would have required new code.

In one sentence. The model was hard-wired: part of the implementation, not of the configuration. Changing it was a coding job, not a setting. virt erases that difference — it brings embedding and VLM up to the same level of virtualization the LLM already had.

2. The solution: ask for a role, not a model

The idea that unlocks everything is a change of question. Instead of asking «give me BGE-M3», the code asks «give me the embedder for the role ‘text’». The caller does not know — and does not care — which model answers: it only knows what it needs it for. Translating the role into the concrete model is the job of a single layer, the runtime/virt/ package, which offers three entry points (in jargon, three facades):

Facade	Ask for a role…	…and you get back
`virt.get_llm(role)`	`"fast"` / `"middle"` / `"wise"` / `"frontier"`	the language model for that tier (delegates to `llm_router`)
`virt.get_embedder(role)`	`"text"` or `"image"`	the embedder for that modality (BGE-M3 for text, SigLIP for images)
`virt.get_vlm(role)`	`"default"`	the VLM's configuration (provider, model, address, limits)

Notice the nature of the roles. For the LLM they are capability tiers (fast for quick answers, wise for hard reasoning, frontier for the paid model used only when needed). For the embedder they are modalities (text or image), because there the useful distinction is not how powerful the model is, but what kind of data it transforms. The role is an abstraction that adapts to the domain.

from virt import get_embedder, get_llm, get_vlm

get_embedder("text").embed_texts([...])     # text vectors, role "text"
get_llm("middle").chat(system, user).text   # LLM answer, tier "middle"
get_vlm()                                    # VLM spec (role "default")

The principle. The caller states a need (a role), not a choice (a model). The choice lives in a single place. It is the same pact that has long held for the LLM: virt extends it to all three families.

Figure 1 — The three facades. On the left, consumers ask for a role; in the center, virt's facades translate that role by reading the TOML configuration files; on the right, the concrete models. No consumer names a model any more: the configuration does, in a single place.

3. The configuration files

The role-to-model translation lives in three TOML files, one per family, in the user's configuration directory:

~/.config/metnos/llm_tiers.toml — the LLM tiers;
~/.config/metnos/embedding_tiers.toml — the embedder modalities;
~/.config/metnos/vlm_tiers.toml — the VLM configuration.

Each file has flat sections, one per role. Here is what the embedding one really looks like: two lines to say «text needs BGE, images need SigLIP».

[text]
provider = "bge"        # BGE-M3, 1024 dimensions
# To point at a REMOTE embedder:
#   provider = "http"
#   base_url = "http://host:port"

[image]
provider = "siglip"     # SigLIP, 768 dimensions, text+image

The VLM one is just as small — a single [default] section with the provider, the model, the server's address and the image limits:

[default]
provider   = "llamacpp"               # multimodal OpenAI-compatible server
model      = "qwen3vl-2b"
base_url   = "http://127.0.0.1:8081"
timeout_s  = 60
max_edge   = 1024                     # resize the image's long edge
max_tokens = 512

The guiding principle: change a model = edit the TOML, not the code. Want a different embedder for text? Change provider in [text]. Want to move the VLM to another machine? Change base_url in [default]. No line of Python to touch, no executor to re-sign.

The files are optional. If they are missing, Metnos uses built-in defaults baked into the code (in virt) that reproduce today's reality exactly: BGE for text, SigLIP for images, Qwen3-VL on :8081. The TOML files are only needed when you want to deviate from the default. Editing them overrides the starting value, line by line; what you don't touch stays as it was.

4. Segregation: a single point to go through

The real gain is not just the convenience of the TOML: it is segregation. Before the change, ten spots in the code imported the concrete embedder. After it, no consumer imports bge_embedding or clip_embedding directly any more: they all go through virt.get_embedder. The knowledge of «which model» has been gathered into a single funnel.

	Before	After
Who names the model	every consumer (routing, indices, nightly jobs, image executors)	only the configuration, read by `virt`
To change a model	find and edit N spots in the code	edit one line of TOML
Remote endpoint	not possible (would need new code)	`provider = "http"` + `base_url`

From this follows a freedom that did not exist before: since everyone asks for the embedder at the same counter, you can swap the implementation out from under their feet without them noticing. In particular, the text embedder can become a remote service — a server behind an HTTP address — just by writing provider = "http" and the address in the TOML. It is the counterpart, for embedding, of that «point a tier at an endpoint» the LLM already did.

The only new piece of code. The existing local classes (BGE, SigLIP) already knew how to answer the right questions, so virt returns them as they are. The only added implementation is HttpEmbedder: the remote embedder, which talks to a server compatible with the OpenAI API. Everything else is routing.

5. Embedding autonomy

There is a second, deeper reason why this tidy-up matters. Historically the text embedding went — at least conceptually — through a shared external structure. Bringing it back inside virt made it clear (and explicit) that the embedders actually already run inside the Metnos process: they are ONNX models loaded in-process, with no dependency on any external structure.

Family	Model	Where it runs
text embedding	BGE-M3 (1024 dim.)	ONNX in-process — no server, no external dependency
image embedding	SigLIP (768 dim.)	ONNX in-process — the same model vectorizes text and image
LLM (text)	Qwen	`llama-server` endpoint on `:8080`
VLM (images)	Qwen3-VL	server on `:8081`, started on demand during indexing

The distinction matters. Embedding — the heart of semantic search, the part that runs on every request — is now autonomous: it lives in the process, calling nothing outside. The LLM and the VLM, by contrast, remain separate servers (with the VLM started only when it is really needed, during photo indexing), but for them too «which model» and «at what address» is now a configuration entry, not a constant in the code.

Why it matters to you. Embedding autonomy means that the part of Metnos you use most often — understanding what you mean and searching your data — depends on no external services and no connection: it runs on your machine, inside the program itself.

6. The lineage: a lightweight subset

Where does virt's shape come from? It is modeled on a pattern already proven elsewhere: declaring the contracts — what an embedder must be able to do, what an LLM must — as Protocols (in Python, an interface that describes the expected methods without imposing inheritance). But virt takes only the bones, deliberately staying a lightweight subset.

The richer starting pattern includes a registry and dependency injection: an infrastructure that builds and hands out objects on demand. virt throws all of that away. It keeps just two things:

the Protocols as contract and typing (what an embedder must expose: embed_texts, embed_query);
a factory that reads the configuration and returns the right object — exactly the shape llm_router already had for the LLM.

Why can it afford so much simplicity? For a precise reason: the local classes that already exist — BGE's, SigLIP's, the llama provider's — already satisfy those Protocols without changes. They expose the expected methods as they are. No adapter needs to be written: the factory returns them directly. A registry, here, would be dead weight.

The house rule. «Deterministic, simple, linear code first of all». A registry with dependency injection would solve a problem Metnos does not have (dozens of interchangeable implementations chosen at runtime). Three facades and a function that reads a file do the same job with a fraction of the moving parts — and it is the same design, already in production, as the LLM router.

7. Going deeper

To understand…	Read
the three LLM tiers, the aliases and the opt-in frontier	multilang (tier section)
where embedding enters tool selection (semantic nearness)	fastpath and autopath
the VLM at work: how photos are described during indexing	executor
the skill-versus-backend boundary (another axis of swappability)	skills & backends

8. What this means for you, in practice

What this means for you, in practice. Almost always: nothing. The defaults cover normal use and Metnos picks the right models on its own. This machinery shows up in a single case — when you want to change a model: trying a newer embedder, moving the VLM to another machine, or leaning on an external server. At that moment the difference is stark: you open a text file, change one line, save. There is no code to modify, nothing to recompile, nothing to re-sign.

The upside. Metnos is built to stay as local as possible: the brain you use every day runs on your own machine. Semantic search — the embedding — is now fully autonomous, inside the process, with no dependencies. And when you decide to do otherwise — a more powerful model, a remote server, a bespoke solution — the door is open: one configuration entry is enough. Control stays yours.

Metnos — model virtualization, ask for a role, not a model

Metnos

Table of contents

1. The problem: the model hard-wired in the code

2. The solution: ask for a role, not a model

3. The configuration files

4. Segregation: a single point to go through

5. Embedding autonomy

6. The lineage: a lightweight subset

7. Going deeper

8. What this means for you, in practice