-
Notifications
You must be signed in to change notification settings - Fork 118
Support configurable embedding vector dimension for OpenAI-compatible providers #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I am currently testing the changes in this PR with our stack at faktenforum/ai-chat-interface#17 and it seems to be working well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables using OpenAI-compatible embedding providers that return non-1536 dimensional vectors by making the SQLite documents_vec virtual table’s embedding dimension configurable at runtime via config.embeddings.vectorDimension.
Changes:
- Add validation to require
embeddings.vectorDimension >= 1. - Add migration 012 to drop
documents_vecso it can be recreated at runtime instead of being fixed at 1536 in migrations. - Add
DocumentStore.ensureVectorTable()to create/reconciledocuments_vecusing the configured dimension and backfill fromdocuments.embedding; update migration tests accordingly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/utils/config.ts | Adds a minimum constraint for embeddings.vectorDimension. |
| src/store/applyMigrations.test.ts | Updates migration expectations and creates a test-only vec table for vector-search tests. |
| src/store/DocumentStore.ts | Creates/reconciles documents_vec at startup using configurable dimension and performs backfill. |
| db/migrations/012-drop-documents-vec-for-runtime.sql | Drops documents_vec during migrations to defer creation to runtime. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FROM documents d | ||
| JOIN pages p ON d.page_id = p.id | ||
| JOIN versions v ON p.version_id = v.id | ||
| WHERE d.embedding IS NOT NULL; |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureVectorTable() backfills documents_vec directly from documents.embedding without ensuring the JSON array length matches the configured dimension. If a user changes embeddings.vectorDimension (e.g., from 1536 -> 3584), existing stored embeddings will typically still be 1536-long and inserting them into FLOAT[3584] is likely to fail, preventing startup. Consider filtering the backfill to only rows whose embedding length matches dim, or explicitly padding/normalizing (or clearing) existing embeddings when recreating the table so initialization can succeed after a dimension change.
| WHERE d.embedding IS NOT NULL; | |
| WHERE d.embedding IS NOT NULL | |
| AND json_array_length(d.embedding) = ${dim}; |
| const existingSql = this.db | ||
| .prepare( | ||
| "SELECT sql FROM sqlite_master WHERE type = 'table' AND name = 'documents_vec';", | ||
| ) | ||
| .get() as { sql: string } | undefined; | ||
|
|
||
| if (existingSql) { | ||
| const match = existingSql.sql.match(/embedding\s+FLOAT\s*\[\s*(\d+)\s*]/i); | ||
| const existingDim = match ? Number(match[1]) : null; | ||
| if (existingDim === dim) { | ||
| return; | ||
| } | ||
| this.db.exec("DROP TABLE documents_vec;"); | ||
| } | ||
|
|
||
| this.db.exec(` | ||
| CREATE VIRTUAL TABLE documents_vec USING vec0( | ||
| library_id INTEGER NOT NULL, | ||
| version_id INTEGER NOT NULL, | ||
| embedding FLOAT[${dim}] | ||
| ); | ||
| `); | ||
| this.db.exec(` | ||
| INSERT OR REPLACE INTO documents_vec (rowid, library_id, version_id, embedding) | ||
| SELECT d.id, v.library_id, v.id, json_extract(d.embedding, '$') | ||
| FROM documents d | ||
| JOIN pages p ON d.page_id = p.id | ||
| JOIN versions v ON p.version_id = v.id | ||
| WHERE d.embedding IS NOT NULL; | ||
| `); |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureVectorTable() uses a check-then-create pattern (SELECT sqlite_master then CREATE VIRTUAL TABLE) without IF NOT EXISTS / DROP ... IF EXISTS or a transaction. If two processes/instances initialize against the same DB concurrently, one can race and throw "table already exists" (or drop conflicts). Making the DDL idempotent (e.g., CREATE VIRTUAL TABLE IF NOT EXISTS, DROP TABLE IF EXISTS) and/or wrapping the reconcile step in a transaction would make startup more reliable under concurrent access.
| const existingSql = this.db | |
| .prepare( | |
| "SELECT sql FROM sqlite_master WHERE type = 'table' AND name = 'documents_vec';", | |
| ) | |
| .get() as { sql: string } | undefined; | |
| if (existingSql) { | |
| const match = existingSql.sql.match(/embedding\s+FLOAT\s*\[\s*(\d+)\s*]/i); | |
| const existingDim = match ? Number(match[1]) : null; | |
| if (existingDim === dim) { | |
| return; | |
| } | |
| this.db.exec("DROP TABLE documents_vec;"); | |
| } | |
| this.db.exec(` | |
| CREATE VIRTUAL TABLE documents_vec USING vec0( | |
| library_id INTEGER NOT NULL, | |
| version_id INTEGER NOT NULL, | |
| embedding FLOAT[${dim}] | |
| ); | |
| `); | |
| this.db.exec(` | |
| INSERT OR REPLACE INTO documents_vec (rowid, library_id, version_id, embedding) | |
| SELECT d.id, v.library_id, v.id, json_extract(d.embedding, '$') | |
| FROM documents d | |
| JOIN pages p ON d.page_id = p.id | |
| JOIN versions v ON p.version_id = v.id | |
| WHERE d.embedding IS NOT NULL; | |
| `); | |
| // Ensure that inspection, potential drop, creation, and backfill are atomic | |
| this.db.exec("BEGIN IMMEDIATE;"); | |
| try { | |
| const existingSql = this.db | |
| .prepare( | |
| "SELECT sql FROM sqlite_master WHERE type = 'table' AND name = 'documents_vec';", | |
| ) | |
| .get() as { sql: string } | undefined; | |
| if (existingSql) { | |
| const match = existingSql.sql.match(/embedding\s+FLOAT\s*\[\s*(\d+)\s*]/i); | |
| const existingDim = match ? Number(match[1]) : null; | |
| if (existingDim === dim) { | |
| this.db.exec("COMMIT;"); | |
| return; | |
| } | |
| this.db.exec("DROP TABLE IF EXISTS documents_vec;"); | |
| } | |
| this.db.exec(` | |
| CREATE VIRTUAL TABLE IF NOT EXISTS documents_vec USING vec0( | |
| library_id INTEGER NOT NULL, | |
| version_id INTEGER NOT NULL, | |
| embedding FLOAT[${dim}] | |
| ); | |
| `); | |
| this.db.exec(` | |
| INSERT OR REPLACE INTO documents_vec (rowid, library_id, version_id, embedding) | |
| SELECT d.id, v.library_id, v.id, json_extract(d.embedding, '$') | |
| FROM documents d | |
| JOIN pages p ON d.page_id = p.id | |
| JOIN versions v ON p.version_id = v.id | |
| WHERE d.embedding IS NOT NULL; | |
| `); | |
| this.db.exec("COMMIT;"); | |
| } catch (error) { | |
| this.db.exec("ROLLBACK;"); | |
| throw error; | |
| } |
| -- Migration: Drop documents_vec so the app can recreate it at runtime with configurable embedding dimension. | ||
| -- Supports different embedding providers (e.g. 1536 vs 3584 dimensions). The table is recreated in | ||
| -- DocumentStore.ensureVectorTable() using config.embeddings.vectorDimension (env: DOCS_MCP_EMBEDDINGS_VECTOR_DIMENSION). | ||
| DROP TABLE IF EXISTS documents_vec; |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping documents_vec unconditionally forces a full vector index rebuild on next startup for every existing installation (even when staying on 1536d). For large stores this can be a costly operation and increase downtime. If runtime reconciliation already handles mismatched dimensions, consider avoiding the unconditional drop here (or limiting it to cases where it’s actually needed) to preserve existing indexes when the configured dimension hasn’t changed.
| async initialize(): Promise<void> { | ||
| try { | ||
| // 1. Load extensions first (moved before migrations) | ||
| sqliteVec.load(this.db); | ||
|
|
||
| // 2. Apply migrations (after extensions are loaded) | ||
| await applyMigrations(this.db, { | ||
| maxRetries: this.config.db.migrationMaxRetries, | ||
| retryDelayMs: this.config.db.migrationRetryDelayMs, | ||
| }); | ||
|
|
||
| // 3. Initialize prepared statements | ||
| // 3. Create vector table at runtime with configurable dimension (migration 012 drops it for us to recreate) | ||
| this.ensureVectorTable(); | ||
|
|
||
| // 4. Initialize prepared statements | ||
| this.prepareStatements(); | ||
|
|
||
| // 4. Initialize embeddings client (await to catch errors) | ||
| // 5. Initialize embeddings client (await to catch errors) | ||
| await this.initializeEmbeddings(); | ||
| } catch (error) { | ||
| // Re-throw StoreError, ModelConfigurationError, and UnsupportedProviderError directly | ||
| if ( | ||
| error instanceof StoreError || | ||
| error instanceof ModelConfigurationError || | ||
| error instanceof UnsupportedProviderError | ||
| ) { | ||
| throw error; | ||
| } | ||
| throw new ConnectionError("Failed to initialize database connection", error); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Gracefully closes database connections | ||
| */ | ||
| async shutdown(): Promise<void> { | ||
| this.db.close(); | ||
| } | ||
|
|
||
| /** | ||
| * Creates or reconciles the documents_vec virtual table with configurable dimension. | ||
| * Called after migrations; migration 012 drops the table so we recreate it from config here. | ||
| * Idempotent: if the table already exists with the same dimension, no-op; if dimension | ||
| * changed in config, drops and recreates so any embedding provider (e.g. 1536 or 3584) works. | ||
| */ | ||
| private ensureVectorTable(): void { | ||
| const dim = this.config.embeddings.vectorDimension; | ||
| if (typeof dim !== "number" || !Number.isInteger(dim) || dim < 1) { | ||
| throw new StoreError( | ||
| `Invalid embeddings.vectorDimension: ${dim}. Must be a positive integer.`, | ||
| ); | ||
| } |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New runtime behavior (creating/reconciling documents_vec based on embeddings.vectorDimension) doesn’t appear to have direct test coverage. Adding tests around DocumentStore initialization that assert the created table schema uses the configured dimension, and that changing the configured dimension is handled safely (including existing embeddings) would help prevent regressions.
We use an OpenAI-compatible embeddings API (Scaleway) with the model
bge-multilingual-gemma2, which outputs 3584-dimensional vectors. The current schema fixes the vector table to 1536 dimensions in migrations, which causes a dimension mismatch and prevents using this (and other non-1536) models.This PR makes the vector dimension configurable:
documents_vecso it is no longer created with a fixed dimension in migrations.DocumentStore.ensureVectorTable()createsdocuments_vecusingconfig.embeddings.vectorDimension(already exposed viaDOCS_MCP_EMBEDDINGS_VECTOR_DIMENSION) and runs the same backfill as in migration 011. Existing triggers continue to apply.Users can set
DOCS_MCP_EMBEDDINGS_VECTOR_DIMENSION=3584(or the dimension of their model) when using Scaleway or any other provider whose embedding dimension is not 1536. The default remains 1536 for existing setups.