File: src/lib/embedder.ts
The embedding system converts text into high-dimensional vectors for semantic search. Supports local ONNX models via @huggingface/transformers and remote HTTP proxies.
Xenova/bge-m3 — the default embedding model (docs, knowledge, tasks, skills, files):
- 1024 dimensions
- Multilingual (100+ languages)
- 8K token context
- ~560 MB download size
- Pooling:
cls - Normalization: L2-normalized (cosine similarity = dot product)
jinaai/jina-embeddings-v2-base-code — the default code graph model:
- 768 dimensions
- Trained on code + natural language pairs
- 8K token context
- Pooling:
mean - Normalization: L2-normalized
The code graph uses a separate model inheritance chain (codeModel) so it can use a code-optimized model by default while other graphs use BGE-M3.
Two-level cache with deduplication and lazy loading:
_pipes: Map<name, Pipeline | ModelConfig> — named models (e.g. "my-app:docs", "my-app:code")
_modelCache: Map<modelString, Pipeline> — deduplicates by model config string
loadModel() only registers the model configuration in _pipes — it does not create the ONNX pipeline. The actual pipeline is created lazily on the first call to embed(), embedBatch(), or embedQuery() for that model. This reduces peak memory by deferring model loads until each model is actually needed.
The same physical model is loaded only once in memory, even if used by multiple graphs or projects.
When creating a pipeline, the following ONNX Runtime session options are applied to reduce memory footprint:
session_options: {
enableCpuMemArena: false, // disable pre-allocated CPU memory arena
enableMemPattern: false, // disable memory pattern optimization
executionMode: 'sequential', // single-threaded execution
}These options trade a small amount of throughput for significantly lower memory usage, which is important when multiple models may be loaded simultaneously.
| Function | Description |
|---|---|
loadModel(model, embedding, modelsDir, name) |
Register model config (lazy — pipeline created on first use) |
embed(title, content, modelName?) |
Single embedding: "title\ncontent" → number[] (triggers lazy load if needed) |
embedQuery(query, modelName?) |
Query embedding with queryPrefix prepended (triggers lazy load if needed) |
embedBatch(inputs, modelName?) |
Batch embedding: multiple items in one forward pass (triggers lazy load if needed) |
cosineSimilarity(a, b) |
Dot product (vectors are L2-normalized) |
disposeModel(name) |
Dispose a single named pipeline and free its resources |
disposeAllModels() |
Dispose all loaded pipelines and clear the model cache |
Configuration is split into two separate objects: model (what model to use) and embedding (how to use it).
Taken as a whole object from the first level that defines it (no field-by-field merge):
graph.model → project.model → server.model → defaults (all graphs except code)
graph.model → project.codeModel → server.codeModel → code defaults (code graph)
| Field | Type | Default | Description |
|---|---|---|---|
name |
string | Xenova/bge-m3 |
HuggingFace model ID |
pooling |
string | cls |
Pooling strategy: mean or cls |
normalize |
boolean | true |
L2-normalize output vectors |
dtype |
string | q8 |
Quantization: fp32, fp16, q8, q4 |
queryPrefix |
string | "" |
Prefix prepended to search queries |
documentPrefix |
string | "" |
Prefix prepended to documents during indexing |
Each field individually inherits up the chain (field-by-field merge):
graph.embedding → project.embedding → server.embedding → defaults
| Field | Type | Default | Description |
|---|---|---|---|
batchSize |
number | 1 |
Texts per ONNX forward pass |
maxChars |
number | 24000 |
Max characters fed to embedder per node |
cacheSize |
number | 10000 |
Embedding cache size (0 = disabled) |
remote |
string | — | Remote embedding API URL |
remoteApiKey |
string | — | API key for remote endpoint |
remoteModel |
string | — | Which model to request: "default" or "code" (auto-set to "code" for code graph) |
model:
name: "Xenova/bge-m3"
pooling: "cls"
normalize: truemodel:
name: "Xenova/bge-base-en-v1.5"
pooling: "cls"
normalize: true
queryPrefix: "Represent this sentence for searching relevant passages: "model:
name: "Xenova/bge-small-en-v1.5"
pooling: "cls"
normalize: true
queryPrefix: "Represent this sentence for searching relevant passages: "model:
name: "Xenova/all-MiniLM-L6-v2"
pooling: "mean"
normalize: truemodel:
name: "nomic-ai/nomic-embed-text-v1.5"
pooling: "mean"
normalize: true
queryPrefix: "search_query: "
documentPrefix: "search_document: "model:
name: "Xenova/bge-m3"
pooling: "cls"
normalize: true
dtype: "q8" # fp32, fp16, q8, q4Instead of loading a local ONNX model, a project can delegate embedding to a remote server:
server:
embedding:
remote: "http://gpu-server:3000/api/embed"
remoteApiKey: "emb-secret-key"When remote is set:
loadModel()registers an HTTP proxy instead of downloading an ONNX modelembed()andembedBatch()forward requests to the remote endpoint via HTTP POST- URL validation enforces
http:orhttps:protocols only (SSRF protection)
This is useful for delegating embedding to a GPU machine running the Embedding API endpoint (see below).
The Embedding API also accepts format: "base64" in the POST body to return embeddings as Base64-encoded Float32 arrays instead of JSON number arrays.
File: src/lib/embedding-codec.ts
Embedding vectors are stored in graph JSON files using Base64 encoding for compact serialization. Float32 arrays are encoded as Base64 strings, saving ~3x space vs JSON number arrays.
- Save:
compressEmbeddings(exported)convertsnumber[]→ Base64 string for fieldsembeddingandfileEmbedding - Load:
decompressEmbeddings(exported)converts Base64 string →number[] - Backwards compatible: detects old format (
number[]) on load and passes it through unchanged
Buffer alignment is handled explicitly — an optimized Buffer.from(base64, 'base64') approach is used with aligned copy before constructing a Float32Array view.
The server can expose its local embedding model as a REST endpoint:
server:
embeddingApi:
enabled: true
apiKey: "emb-secret-key" # optional, separate from user apiKeys
maxTexts: 100 # max texts per request (default 100)
maxTextChars: 10000 # max chars per text (default 10000)POST /api/embed
// Request (default model)
{ "texts": ["hello world", "another text"] }
// Request (code model)
{ "texts": ["function login() { ... }"], "model": "code" }
// Response
{ "embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]] }The model parameter selects which embedding model to use: "default" (general, BGE-M3) or "code" (code-optimized, jina-code). Both models are loaded when the embedding API is enabled.
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | false |
Enable the /api/embed endpoint |
apiKey |
string | — | Optional API key (separate from user apiKeys) |
maxTexts |
number | 100 |
Maximum number of texts per request |
maxTextChars |
number | 10000 |
Maximum characters per individual text |
Validation enforces maxTexts and maxTextChars limits on every request.
When apiKey is set, requests must include Authorization: Bearer <apiKey>. The embedding API key is separate from user authentication keys.
Files larger than maxFileSize (default 1 MB / 1048576 bytes) are skipped during indexing. This prevents embedding excessively large files. The limit can be set at server, workspace, or project level:
server:
maxFileSize: 1048576 # 1 MB (default)
projects:
my-app:
maxFileSize: 2097152 # 2 MB override for this projectThe server applies per-IP rate limits (requests per minute) to protect the embedding API and other endpoints:
server:
rateLimit:
global: 600 # all endpoints (default 600/min)
search: 120 # search endpoints (default 120/min)
auth: 10 # login/token endpoints (default 10/min)Different graphs can use different embedding models. Model config is taken as a whole object (first-defined-wins), so each graph that defines model gets it entirely:
projects:
my-app:
projectDir: "/path/to/my-app"
model:
name: "Xenova/bge-m3" # default for most graphs
pooling: "cls"
normalize: true
graphs:
files:
model:
name: "Xenova/bge-small-en-v1.5" # smaller model for file paths
pooling: "cls"
normalize: true
code:
model:
name: "Xenova/bge-base-en-v1.5" # different model for code
pooling: "cls"
normalize: trueEach persisted graph JSON stores two fields that trigger re-indexing:
version— a data schema version (GRAPH_DATA_VERSION). Bumped on changes to what gets embedded, path normalization, or stored format.embeddingModel— a fingerprint of the embedding config (model name + pooling + normalize + documentPrefix + dtype).
On load, if either field doesn't match the current values, the graph is automatically discarded and re-indexed from scratch.
Models are cached locally at ~/.graph-memory/models/ (configurable via server.modelsDir). First startup downloads the model; subsequent starts use the cache.
For Docker, mount a named volume to /data/models/ to persist the cache across container restarts.