-
Notifications
You must be signed in to change notification settings - Fork 575
Description
๐ Static Mapping Table for Cross-Script BM25 Query Expansion โ Rescue 51% of Bilingual Queries (Verified)
๐ฏ Problem
The simple FTS tokenizer produces BM25=0 for cross-script queries (ENโCJK, SCโTC). For bilingual users โ whose memories are mostly English but queries often use Chinese โ 51% of real queries lose BM25 signal entirely, causing a 29-50% fused score penalty in hybrid retrieval.
Real-world scenario (verified with 314 live memories, 33 real queries):
Memory stored: "Kidney disease sodium intake: limit daily sodium to 2000mg" (EN)
Query: "่
่็พ็
้ๆๅ" (TC)
BM25 result: 0.00 โ (no token overlap between scripts)
Query distribution from actual memory_recall usage:
| Script | Count | % | BM25 Status |
|---|---|---|---|
| EN only | 15 | 48% | โ Works (EN memories match) |
| CJK only | 11 | 35% | โ BM25=0 against EN memories |
| Mixed CJK+EN | 5 | 16% | |
| SCโTC cross-script | 2 | 6% | โ BM25=0 against TC memories |
Result: 51% of real queries lose BM25 signal due to script mismatch.
๐ Verified Test Results
A. 24-query controlled benchmark (16 bilingual TC+EN memories):
| Direction | Baseline | With Mapping Table | ฮ |
|---|---|---|---|
| ENโTC | 8/8 | 8/8 | โ |
| SCโTC | 0/8 | 7/8 | +7 rescued โ |
| TCโTC | 0/8 | 2/8 | +2 (limited by tokenizer) |
| Total BM25 hits | 8/24 | 17/24 | +112% |
B. Live A/B test (10 real memories, 8 ENโCJK query pairs from session history):
| Baseline | With Mapping Table | ฮ | |
|---|---|---|---|
| BM25 > 0 | 6/16 | 11/16 | +83% |
| Queries rescued | โ | 5 | from 0.000 to 0.09-0.40 |
The 5 rescued queries:
| Query (CJK) | BM25 before | BM25 after | Matched Memory |
|---|---|---|---|
้ข้ๅๆ่งๅ (SC) |
0.000 | 0.111 | Discord ้ ป้็ผๆ่ฆๅโฆ (TC) |
ๆฏๆฅๆฐ่ๆ่ฆ |
0.000 | 0.286 | Daily news digest cronโฆ (EN) |
่
่็พ็
้ๆๅ |
0.000 | 0.364 | Kidney disease sodiumโฆ (EN) |
ๆฑฝ่ป้ป็ถๆ้ป็ถ |
0.000 | 0.400 | P2135 throttle bodyโฆ Batteryโฆ (EN) |
ไฝฟ็จ่
็็จๅผ้ขจๆ ผๅๅฅฝ |
0.000 | 0.095 | User prefers interactiveโฆ (EN) |
๐ก Proposed Solution: Generalize expandQuery() with Configurable Mapping Table
PR #112 / #292 proved that query expansion works โ expandQuery() with a hardcoded SYNONYM_MAP of ~15 colloquial Chinese synonyms (ๆไบโcrash, ่ธฉๅโtroubleshoot).
This proposal generalizes that approach: replace the hardcoded SYNONYM_MAP with a configurable JSON mapping table (49K+ entries from MUSE + OpenCC), covering the cross-script BM25=0 problem โ a bigger issue affecting 51% of bilingual queries.
The existing expandQuery() handles same-language colloquialโtechnical synonyms. This extends it to handle cross-script expansion (ENโTCโSC) using a static bilingual dictionary. Original query preserved for embedding (no dilution).
Query: "้ข้ๅๆ่งๅ" (SC)
โ QueryExpanderFn (static lookup, ~0ms)
Expanded: "้ข้ๅๆ่งๅ channel posting rules ้ ป้ ็ผๆ ่ฆๅ"
โ
BM25: matches "channel posting rules" (EN) โ
BM25: matches "้ ป้็ผๆ่ฆๅ" (TC) โ
Embedding: uses original "้ข้ๅๆ่งๅ" (no dilution) โ
๐ง Implementation
1. New QueryExpanderFn type
// retriever.ts โ new type, coexists with existing QueryTranslatorFn
export type QueryExpanderFn = (query: string) => string[];
export interface MemoryRetrieverConfig {
// ... existing fields ...
queryTranslator?: QueryTranslatorFn; // existing: async LLM translation
queryExpander?: QueryExpanderFn; // NEW: sync static expansion
}2. Integration in fusedSearch()
// retriever.ts fusedSearch() โ add before BM25 search
async fusedSearch(query, limit, scopeFilter, category) {
const queries = [query];
// Step 1: Static expansion (sync, ~0ms) โ NEW
if (this.config.queryExpander) {
const expanded = this.config.queryExpander(query);
for (const e of expanded) {
if (e.trim() && e !== query && !queries.includes(e)) {
queries.push(e);
}
}
}
// Step 2: LLM translation (async, existing)
if (this.config.queryTranslator) {
const translated = await this.config.queryTranslator(query);
for (const t of translated) {
if (t.trim() && t !== query && !queries.includes(t)) {
queries.push(t);
}
}
}
// Run BM25 for all variants (existing parallel logic)
const allResults = await Promise.all(
queries.map(q => this.store.bm25Search(q, limit, scopeFilter))
);
// ... existing dedup + fusion logic ...
}3. Static mapping table expander
// query-expander.ts โ new file
import type { QueryExpanderFn } from "./retriever.js";
const CJK_RE = /[\u4e00-\u9fff\u3400-\u4dbf]+/g;
interface MappingTable {
_meta?: Record<string, unknown>;
lookup: Record<string, string[]>;
}
export function createMappingTableExpander(
tablePath: string,
options?: { maxExpansions?: number }
): QueryExpanderFn {
const max = options?.maxExpansions ?? 20;
const raw = JSON.parse(fs.readFileSync(tablePath, "utf-8")) as MappingTable;
const lookup = raw.lookup ?? raw;
// Build reverse index for O(1) lookup
const index = new Map<string, Set<string>>();
for (const [key, values] of Object.entries(lookup)) {
if (key === "_meta") continue;
if (!index.has(key)) index.set(key, new Set());
for (const v of values) {
index.get(key)!.add(v);
if (!index.has(v)) index.set(v, new Set());
index.get(v)!.add(key);
}
}
return (query: string): string[] => {
const expansions = new Set<string>();
// CJK substring matching
for (const match of query.matchAll(CJK_RE)) {
const cjk = match[0];
// Try full phrase, then 4โ3โ2 char substrings
for (let len = cjk.length; len >= 2; len--) {
for (let start = 0; start <= cjk.length - len; start++) {
const sub = cjk.substring(start, start + len);
const hits = index.get(sub);
if (hits) {
for (const h of hits) {
if (expansions.size >= max) break;
expansions.add(h);
}
}
}
}
}
// EN word matching
for (const word of query.toLowerCase().match(/[a-z]{2,}/g) ?? []) {
const hits = index.get(word);
if (hits) {
for (const h of hits) {
if (expansions.size >= max) break;
expansions.add(h);
}
}
}
if (expansions.size === 0) return [];
// Return expanded query as single string (BM25 treats as OR)
return [`${query} ${[...expansions].join(" ")}`];
};
}4. Plugin config
๐ Adaptive Mapping via Query History (User-Side Enhancement)
A cron script mines memory_recall query history to build a personalized mapping table โ only terms the user actually uses, not the full 49K dictionary.
Pipeline:
Session JSONL โ jieba word segmentation โ MUSE base table lookup
โ Ollama batch translate (unmatched)
โ user-mapping.json
Why jieba? Without word segmentation, CJK strings get split into meaningless 2-char fragments (้ๅ, ่็พ). With jieba: 189 fragments โ 54 meaningful words (โ71%), translations needed 150 โ 18 (โ88%), zero garbage output.
| Without jieba | With jieba | |
|---|---|---|
| Terms extracted | 189 | 54 (โ71%) |
| Base table hits | 39 | 36 |
| Need translation | 150 | 18 (โ88%) |
| Garbage terms | ้ๅโDao Fa, ่็พโDisease |
0 โ |
| Translation quality | Mixed | 18/18 meaningful โ |
Cron script concept (update-mapping-from-history.py):
# 1. Scan session JSONL for memory_recall queries
queries = extract_queries("~/.openclaw/agents/main/sessions/")
# โ Found 31 queries (16 with CJK, 51%)
# 2. Word segmentation with jieba
terms = extract_cjk_terms(queries, use_jieba=True)
# โ 54 meaningful CJK terms
# 3. Look up in MUSE base table (49K ENโTCโSC entries)
matched, unmatched = match_terms(terms, base_table)
# โ 36 matched, 18 unmatched
# 4. Batch-translate unmatched via Ollama (optional)
translated = translate_batch(unmatched, model="rinex20/translategemma3:12b")
# โ 18 new translations, all meaningful
# 5. Merge into user-mapping.json
merge_tables(base_table, {**matched, **translated}, "user-mapping.json")Schedule: Daily at 2AM. Each run is incremental โ new terms added, existing preserved.
๐ Performance Comparison โ All Approaches Tested
| Approach | BM25 Hits | Latency | Dependencies | Migration | False Pos | Cost |
|---|---|---|---|---|---|---|
| Baseline (simple) | 8/24 | 0ms | none | โ | low | $0 |
| โญ Static mapping table | 17/24 | ~0ms | none | none | low | $0 |
| Write-time enrichment | ~18/24 | 0ms read | LLM at write | re-index all | low | LLM per write |
| LLM query translation | ~16/24 | +2-4s | Ollama 7GB | none | low | LLM per query |
| ngram(2,3) tokenizer | ~20/24 | 0ms | upstream fix | re-index | $0 | |
| Mapping + ngram | ~22-24 | ~0ms | upstream | re-index | medium | $0 |
โก Why This Over Other Approaches?
| Mapping Table | Existing QueryTranslatorFn |
|
|---|---|---|
| Latency | ~0ms (hash lookup) | +2-4s (LLM inference) |
| Dependencies | None (JSON file) | Ollama + 7GB model |
| Offline | Yes | Needs running Ollama |
| Deterministic | Yes | No (LLM varies) |
| Cost per query | $0 | GPU time |
| Coexistence | โ Can use both | โ Can use both |
The two are complementary: mapping table handles known terms instantly, QueryTranslatorFn handles novel/creative queries via LLM. Users can enable either or both.
โ ๏ธ Known Limitations
- Cannot fix same-script CJK matching (TCโTC, SCโSC) โ
simpletokenizer still treatsไผบๆๅจๅไปฝ่จญๅฎas one token. Needs ngram tokenizer (lancedb#1315, lancedb#2329). - Static vocabulary โ new terms need table updates (mitigated by cron script).
- Expansion can dilute BM25 ranking โ more tokens = lower per-token weight. Mitigated by
maxExpansionscap (default 20).
Mapping table and ngram solve orthogonal problems:
- Mapping table โ cross-script (SCโTC, ENโCJK)
- ngram โ same-script CJK partial matching
๐ Reproduction
| Resource | URL |
|---|---|
| Full reproduction repo | https://github.com/andychu666/memory-lancedb-pro-bm25-cjk-repro |
| Mapping table data | mapping-tables/ in repro repo |
| Benchmark scripts | repro-mapping-table.sh, benchmark-ab-test.py |
| Cron script | update-mapping-from-history.py |
| Live A/B test | benchmark-ab-test.py (10 real memories, 8 query pairs) |
| MUSE dictionaries | https://github.com/facebookresearch/MUSE |
๐ Related
- #292 โ
expandQuery()with hardcoded synonyms (this proposal generalizes it) - #112 โ Original
expandQuery()PR (merged tomain) - #271 โ CJK BM25 cross-script issue (LLM translation approach)
- lancedb#1315 โ Enable choosing tokenizer in LanceDB FTS
- lancedb#2329 โ Support Chinese/CJK text search in BM25
๐ฏ Summary
| Metric | Value |
|---|---|
| Affected queries | 51% of bilingual user queries |
| BM25 improvement | +83% to +112% (verified) |
| Implementation size | ~150 LOC (new expander) + JSON table |
| Query latency impact | ~0ms |
| External dependencies | None |
| Migration cost | None (works on existing memories) |
| Coexists with QueryTranslatorFn | Yes |
Plugin version: 1.1.0-beta.8
Test date: 2026-03-21
Tester: @andychu666