This document explains, in plain language, how the system routes a user question to the right database and SQL—using a two-stage search (Catalog then Query Bank) with semantic vectors (KNN) and BM25 keyword scores. It also shows how the fallback LLM is prompted, plus handy debug endpoints and knobs to tune.
Flow (2 stages + fallback):
-
Pick DB (Catalog):
Search Redisidx:tables(one doc per table) with your question.- Semantic KNN over
nl_desc_vector(high signal). - BM25 text over
table / columns / description(supporting signal). - Small domain keyword/tag boosts.
Aggregate scores perdb_name→ shortlist 1–2 DBs.
- Semantic KNN over
-
Find template (Query Bank):
Search Redisidx:qcachefor a templated signature similar to your question (vector KNN), filtered to that shortlist.
Accept whensimilarity ≥ 1 - BANK_ACCEPT. -
Fallback LLM:
If no bank hit, build a prompt for the configured provider (e.g., Cloudflare) and generate SQL.
Include table hints from the catalog when available.
Each table is a Redis doc with fields like:
db_name(TAG): logical domain (e.g.,statements,informatica)table(TEXT)columns(TEXT) – optional, if availabledescription(TEXT)domain_tags(TAG)nl_desc_vector(VECTOR, FLOAT32, COSINE)
Vectors (nl_desc_vector) are built from concatenated table text (table + columns + description).
Each template is a Redis doc with:
db_name(TAG) – same domain concept as abovesignature(TEXT) – templated NL signature of the questionsql_template(TEXT)nl_desc_vector(VECTOR, FLOAT32, COSINE) – embedding of the signature
Note: We compare question↔signature, not question↔SQL text.
Connection/config metadata for executors. Not used for matching (no search or vectors).
Goal: choose the most likely db_name(s) from table-level hits.
- Semantic KNN on
@nl_desc_vectorusing an embedding of the user question. - BM25 text search over
table,columns,description.
For each hit (table doc) we add to its db_name bucket:
+ semantic_similarity(computed as1 - cosine_distance).+ 0.3 * bm25_score(BM25 is helpful but down-weighted).+ small keyword/tag boostif the question contains domain telltales.
Then we rank DBs by total score:
- Enforce
CATALOG_MIN_SCORE; if none pass, keep top-1 anyway. - If the gap between #1 and #2 is small
(s1 - s2) < DB_GAP_THRESHOLD, include both (up toSHORTLIST_MAX).
- Semantic (KNN) captures meaning and paraphrases—great for long or natural phrases.
- BM25 nails exact tokens (e.g.,
CUSTOMER_TRANSACTIONS, acronyms) and helps very short queries. - Combining both makes the shortlist reliable in real-world wording.
Goal: find a stored template that matches the meaning of the user question.
-
Build a templated signature from the user question (normalize numbers, dates, etc.).
-
Search
idx:qcachewith KNN overnl_desc_vector, filtered to the shortlist DB(s). -
Convert distance to similarity:
similarity = 1 - dist. -
Accept when
similarity ≥ BANK_MIN_SIMILARITY, where:BANK_MIN_SIMILARITY = 1 - BANK_ACCEPT
(e.g.,BANK_ACCEPT=0.26→ threshold0.74)
-
If accepted:
- Fill the
sql_templatewith extracted params. - If DB is local SQLite (e.g.,
informatica), we may execute directly (if enabled). - Else return SQL + params for the proper executor (Oracle, etc.).
- Fill the
-
If no template is good enough → fallback LLM.
We compare question↔signature (both embedded). We do not compare to SQL text.
When there’s no bank hit:
- Use the picked DB and any table hints from Stage 1.
- Build a minimal, strict prompt (“single SELECT, no DDL, limit rows if reasonable”).
- Call the configured provider (e.g., Cloudflare
@cf/defog/sqlcoder-7b-2).
Prompt example (simplified):
You are an expert SQL generator for the 'informatica' database.
Write a SINGLE SQL SELECT statement that answers the question.
Prefer the tables listed if they make sense. No DDL, no comments.
Question:
Which workflows use table CUSTOMER_TRANSACTIONS as source?
SQL:
Q: “statements count by merchant for 2025-07 by status”
- Stage 1 (Catalog): hits tables in
statements→ pickdb_name=statements. - Stage 2 (Bank): question signature ≈ a stored template’s signature → similarity
≈ 1.0≥0.74→ bank hit. - Result: Filled
sql_templateforstatements, returned.
Inspector checklist:
/api/store/catalog/search?q=statements%20count%20merchant&db=statements&knn=1/api/store/qbank/search?q=statements%20by%20status&db=statements&knn=1&include_sql=1
Q: “Which workflows use table CUSTOMER_TRANSACTIONS as source?”
- Stage 1 (Catalog):
CUSTOMER_TRANSACTIONStoken gives strong BM25; semantic also helps → pickdb_name=informatica. - Stage 2 (Bank): similarity is below strict threshold (e.g.,
0.803 < 0.90) → bank miss. - Fallback LLM: prompt built with DB + (if available) table hints; provider returns SQL.
Inspector checklist:
/api/store/catalog/search?q=CUSTOMER_TRANSACTIONS&db=informatica&knn=1/api/store/qbank/search?q=workflows%20table%20source&db=informatica&knn=1
Q: “billing status by merchant”
- Very short; semantic may be weak; keyword boosts for
billingnudgedb_name=billingabove others. - Stage 2 tries bank; if similarity doesn’t pass threshold → LLM.
-
Catalog → DB shortlist
CATALOG_MIN_SCORE(default1.0) — minimum total to accept.DB_GAP_THRESHOLD(default0.6) — include #2 if gap small.SHORTLIST_MAX(default2) — max DBs returned.DB_KEYWORD_BOOSTS,DB_DOMAIN_TAGS— tiny nudges; safe to zero out.
-
Bank (template match)
BANK_ACCEPT— lower accept distance → higher similarity required.
BANK_MIN_SIMILARITY = 1 - BANK_ACCEPT(e.g.,0.26 → 0.74).BANK_TOPK— neighbors to pull from Redis (search breadth).
-
Missing params policies
BANK_MISSING_TIME_POLICY=default | llm | askBANK_DEFAULT_RANGE_HOURS— default time window if needed.BANK_MISSING_ENTITY_POLICY=llm | ask
-
Store Inspector (read-only):
GET /api/store/qbank/search— search templates (text or KNN).
Examples:?q=statements by status?db=statements?q=workflow failures&knn=1&k=10&include_sql=1
GET /api/store/catalog/search— search tables.
Examples:?q=CUSTOMER_TRANSACTIONS&db=informatica&knn=1?db=statements
GET /api/store/databases— distinct DB tags + config snapshot.
-
Router (end-to-end):
POST /api/text2sql— runs the full flow and returns route (bank_hitvscatalog_llm), prompt, SQL, etc.
Q: What exactly is semantic search?
It compares meanings using vectors. We embed your question and the stored text (table descriptions or signatures) into numbers (vectors) and find the nearest neighbors by cosine distance. Great for paraphrases.
Q: What is BM25?
A classic keyword scoring method. It boosts matches that contain your exact words (and balances term frequency vs document length). Great for exact names and short queries.
Q: Why combine both?
Because real questions mix paraphrase and exact tokens. Semantic finds meaning; BM25 catches precise identifiers. Together they’re robust.
Q: How do you convert vector distance to a score?
RediSearch returns cosine distance (0.0 is identical). We use similarity = 1 - dist (closer to 1.0 is better).
Q: db=oracle didn’t work in the inspector. Why?
db filters the domain tag (like statements, informatica), not the SQL engine. If you want to filter by engine, add a separate dialect tag.
- Avoid duplicates in keyword boosts; or dedupe with
set(...)in code. - RediSearch text queries cannot be
"* <terms>"; the inspector builds safe queries to avoid this. databases.jsonis NOT indexed—only used for executors.- Bank compares question↔signature, not to SQL text.
- Too many false bank hits? Raise strictness → lower
BANK_ACCEPT(e.g.,0.10) → threshold0.90. - Bank never hits? Lower threshold (e.g.,
BANK_ACCEPT=0.30→0.70sim). - Wrong DB shortlisted? Reduce/zero the keyword boosts; increase semantic weight by ensuring catalog vectors include good descriptions/columns.
- LLM choosing weird tables? Make sure
tables_hintis populated by filtering tables per picked DB in Stage 1.