LLM · embedding · VLMMetnos uses three families of models: an LLM that reasons and plans, an embedder that turns text and images into vectors for semantic search, and a VLM that looks at photos and describes them. The question this document answers is simple: when I want to swap one of these models, what do I have to touch?
Until recently the answer depended on the model. For the LLM a good solution
already existed: a small layer (llm_router) read from a configuration
file which model to serve, so changing it meant editing that file. For the
embedder and the VLM, however, it did not: the
various parts of the system imported the concrete class
directly — bge_embedding here, clip_embedding
there — and the VLM's address was written by hand in the code.
The flaw is not cosmetic, it is practical. Swapping the embedder meant hunting down every spot in the code that named it and changing them one by one: the semantic routing, the indexing, two nightly processes, the image executors. Ten different places, ten chances to miss one. And pointing the embedding at an external server? It simply was not possible: it would have required new code.
virt erases that difference — it brings embedding and
VLM up to the same level of virtualization the LLM already had.
The idea that unlocks everything is a change of question. Instead of asking
«give me BGE-M3», the code asks «give me the embedder
for the role ‘text’». The caller does not
know — and does not care — which model answers: it only knows
what it needs it for. Translating the role into the concrete model is the
job of a single layer, the runtime/virt/ package, which offers three
entry points (in jargon, three facades):
| Facade | Ask for a role… | …and you get back |
|---|---|---|
virt.get_llm(role) | "fast" / "middle" / "wise" / "frontier" | the language model for that tier (delegates to llm_router) |
virt.get_embedder(role) | "text" or "image" | the embedder for that modality (BGE-M3 for text, SigLIP for images) |
virt.get_vlm(role) | "default" | the VLM's configuration (provider, model, address, limits) |
Notice the nature of the roles. For the LLM they are capability
tiers (fast for quick answers, wise for hard
reasoning, frontier for the paid model used only when needed). For
the embedder they are modalities (text or image), because there
the useful distinction is not how powerful the model is, but what kind of
data it transforms. The role is an abstraction that adapts to the domain.
from virt import get_embedder, get_llm, get_vlm
get_embedder("text").embed_texts([...]) # text vectors, role "text"
get_llm("middle").chat(system, user).text # LLM answer, tier "middle"
get_vlm() # VLM spec (role "default")
virt extends it to all three
families.
virt's facades translate that role by reading the TOML configuration files; on the right, the concrete models. No consumer names a model any more: the configuration does, in a single place.The role-to-model translation lives in three TOML files, one per family, in the user's configuration directory:
~/.config/metnos/llm_tiers.toml — the LLM tiers;~/.config/metnos/embedding_tiers.toml — the embedder modalities;~/.config/metnos/vlm_tiers.toml — the VLM configuration.Each file has flat sections, one per role. Here is what the embedding one really looks like: two lines to say «text needs BGE, images need SigLIP».
[text] provider = "bge" # BGE-M3, 1024 dimensions # To point at a REMOTE embedder: # provider = "http" # base_url = "http://host:port" [image] provider = "siglip" # SigLIP, 768 dimensions, text+image
The VLM one is just as small — a single [default] section with
the provider, the model, the server's address and the image limits:
[default] provider = "llamacpp" # multimodal OpenAI-compatible server model = "qwen3vl-2b" base_url = "http://127.0.0.1:8081" timeout_s = 60 max_edge = 1024 # resize the image's long edge max_tokens = 512
provider in
[text]. Want to move the VLM to another machine? Change
base_url in [default]. No line of Python to touch, no
executor to re-sign.
virt) that
reproduce today's reality exactly: BGE for text, SigLIP for images, Qwen3-VL on
:8081. The TOML files are only needed when you want to deviate
from the default. Editing them overrides the starting value, line by line; what
you don't touch stays as it was.
The real gain is not just the convenience of the TOML: it is
segregation. Before the change, ten spots in the code imported
the concrete embedder. After it, no consumer imports
bge_embedding or clip_embedding directly any more: they
all go through virt.get_embedder. The knowledge of «which
model» has been gathered into a single funnel.
| Before | After | |
|---|---|---|
| Who names the model | every consumer (routing, indices, nightly jobs, image executors) | only the configuration, read by virt |
| To change a model | find and edit N spots in the code | edit one line of TOML |
| Remote endpoint | not possible (would need new code) | provider = "http" + base_url |
From this follows a freedom that did not exist before: since everyone asks for the
embedder at the same counter, you can swap the implementation out from
under their feet without them noticing. In particular, the text embedder
can become a remote service — a server behind an HTTP
address — just by writing provider = "http" and the address in
the TOML. It is the counterpart, for embedding, of that «point a tier at an
endpoint» the LLM already did.
virt
returns them as they are. The only added implementation is
HttpEmbedder: the remote embedder, which talks to a server compatible
with the OpenAI API. Everything else is routing.
There is a second, deeper reason why this tidy-up matters. Historically the text
embedding went — at least conceptually — through a shared external
structure. Bringing it back inside virt made it clear (and explicit)
that the embedders actually already run inside the Metnos process:
they are ONNX models loaded in-process, with no dependency on any external
structure.
| Family | Model | Where it runs |
|---|---|---|
| text embedding | BGE-M3 (1024 dim.) | ONNX in-process — no server, no external dependency |
| image embedding | SigLIP (768 dim.) | ONNX in-process — the same model vectorizes text and image |
| LLM (text) | Qwen | llama-server endpoint on :8080 |
| VLM (images) | Qwen3-VL | server on :8081, started on demand during indexing |
The distinction matters. Embedding — the heart of semantic search, the part that runs on every request — is now autonomous: it lives in the process, calling nothing outside. The LLM and the VLM, by contrast, remain separate servers (with the VLM started only when it is really needed, during photo indexing), but for them too «which model» and «at what address» is now a configuration entry, not a constant in the code.
Where does virt's shape come from? It is modeled on a pattern already
proven elsewhere: declaring the contracts — what an embedder
must be able to do, what an LLM must — as Protocols (in Python, an
interface that describes the expected methods without imposing inheritance). But
virt takes only the bones, deliberately staying a
lightweight subset.
The richer starting pattern includes a registry and dependency
injection: an infrastructure that builds and hands out objects on demand.
virt throws all of that away. It keeps just two things:
embed_texts, embed_query);llm_router already had for the LLM.Why can it afford so much simplicity? For a precise reason: the local classes that already exist — BGE's, SigLIP's, the llama provider's — already satisfy those Protocols without changes. They expose the expected methods as they are. No adapter needs to be written: the factory returns them directly. A registry, here, would be dead weight.
| To understand… | Read |
|---|---|
| the three LLM tiers, the aliases and the opt-in frontier | multilang (tier section) |
| where embedding enters tool selection (semantic nearness) | fastpath and autopath |
| the VLM at work: how photos are described during indexing | executor |
| the skill-versus-backend boundary (another axis of swappability) | skills & backends |
Metnos — model virtualization, ask for a role, not a model