feat: configurable mapping table for cross-script BM25 query expansion (generalizes #292)

## 🚀 Static Mapping Table for Cross-Script BM25 Query Expansion — Rescue 51% of Bilingual Queries (Verified)

### 🎯 Problem

The `simple` FTS tokenizer produces **BM25=0** for cross-script queries (EN↔CJK, SC↔TC). For bilingual users — whose memories are mostly English but queries often use Chinese — **51% of real queries lose BM25 signal entirely**, causing a 29-50% fused score penalty in hybrid retrieval.

**Real-world scenario** (verified with 314 live memories, 33 real queries):

```
Memory stored:  "Kidney disease sodium intake: limit daily sodium to 2000mg"  (EN)
Query:          "腎臟疾病鈉攝取"                                                (TC)
BM25 result:    0.00 ❌ (no token overlap between scripts)
```

**Query distribution from actual `memory_recall` usage:**

| Script | Count | % | BM25 Status |
|--------|-------|---|-------------|
| EN only | 15 | 48% | ✅ Works (EN memories match) |
| CJK only | 11 | 35% | ❌ BM25=0 against EN memories |
| Mixed CJK+EN | 5 | 16% | ⚠️ Partial (only EN tokens match) |
| SC→TC cross-script | 2 | 6% | ❌ BM25=0 against TC memories |

**Result: 51% of real queries lose BM25 signal** due to script mismatch.

### 📊 Verified Test Results

**A. 24-query controlled benchmark** (16 bilingual TC+EN memories):

| Direction | Baseline | With Mapping Table | Δ |
|-----------|----------|-------------------|---|
| EN→TC | 8/8 | 8/8 | — |
| SC→TC | 0/8 | **7/8** | **+7 rescued** ✅ |
| TC→TC | 0/8 | 2/8 | +2 (limited by tokenizer) |
| **Total BM25 hits** | **8/24** | **17/24** | **+112%** |

**B. Live A/B test** (10 real memories, 8 EN↔CJK query pairs from session history):

| | Baseline | With Mapping Table | Δ |
|---|---|---|---|
| **BM25 > 0** | 6/16 | **11/16** | **+83%** |
| **Queries rescued** | — | **5** | from 0.000 to 0.09-0.40 |

The 5 rescued queries:

| Query (CJK) | BM25 before | BM25 after | Matched Memory |
|---|---|---|---|
| `频道发文规则` (SC) | 0.000 | 0.111 | Discord 頻道發文規則… (TC) |
| `每日新聞摘要` | 0.000 | 0.286 | Daily news digest cron… (EN) |
| `腎臟疾病鈉攝取` | 0.000 | 0.364 | Kidney disease sodium… (EN) |
| `汽車電瓶換電瓶` | 0.000 | 0.400 | P2135 throttle body… Battery… (EN) |
| `使用者的程式風格偏好` | 0.000 | 0.095 | User prefers interactive… (EN) |

### 💡 Proposed Solution: Generalize `expandQuery()` with Configurable Mapping Table

[PR #112](https://github.com/CortexReach/memory-lancedb-pro/pull/112) / [#292](https://github.com/CortexReach/memory-lancedb-pro/pull/292) proved that query expansion works — `expandQuery()` with a hardcoded `SYNONYM_MAP` of ~15 colloquial Chinese synonyms (挂了→crash, 踩坑→troubleshoot).

**This proposal generalizes that approach:** replace the hardcoded `SYNONYM_MAP` with a **configurable JSON mapping table** (49K+ entries from [MUSE](https://github.com/facebookresearch/MUSE) + OpenCC), covering the **cross-script BM25=0 problem** — a bigger issue affecting 51% of bilingual queries.

The existing `expandQuery()` handles same-language colloquial→technical synonyms. This extends it to handle **cross-script expansion** (EN↔TC↔SC) using a static bilingual dictionary. Original query preserved for embedding (no dilution).

```
Query:    "频道发文规则" (SC)
                ↓ QueryExpanderFn (static lookup, ~0ms)
Expanded: "频道发文规则 channel posting rules 頻道 發文 規則"
                ↓
BM25:     matches "channel posting rules" (EN) ✅
BM25:     matches "頻道發文規則" (TC) ✅
Embedding: uses original "频道发文规则" (no dilution) ✅
```

### 🔧 Implementation

#### 1. New `QueryExpanderFn` type

```typescript
// retriever.ts — new type, coexists with existing QueryTranslatorFn
export type QueryExpanderFn = (query: string) => string[];

export interface MemoryRetrieverConfig {
  // ... existing fields ...
  queryTranslator?: QueryTranslatorFn;   // existing: async LLM translation
  queryExpander?: QueryExpanderFn;        // NEW: sync static expansion
}
```

#### 2. Integration in `fusedSearch()`

```typescript
// retriever.ts fusedSearch() — add before BM25 search
async fusedSearch(query, limit, scopeFilter, category) {
  const queries = [query];

  // Step 1: Static expansion (sync, ~0ms) — NEW
  if (this.config.queryExpander) {
    const expanded = this.config.queryExpander(query);
    for (const e of expanded) {
      if (e.trim() && e !== query && !queries.includes(e)) {
        queries.push(e);
      }
    }
  }

  // Step 2: LLM translation (async, existing)
  if (this.config.queryTranslator) {
    const translated = await this.config.queryTranslator(query);
    for (const t of translated) {
      if (t.trim() && t !== query && !queries.includes(t)) {
        queries.push(t);
      }
    }
  }

  // Run BM25 for all variants (existing parallel logic)
  const allResults = await Promise.all(
    queries.map(q => this.store.bm25Search(q, limit, scopeFilter))
  );
  // ... existing dedup + fusion logic ...
}
```

#### 3. Static mapping table expander

```typescript
// query-expander.ts — new file
import type { QueryExpanderFn } from "./retriever.js";

const CJK_RE = /[\u4e00-\u9fff\u3400-\u4dbf]+/g;

interface MappingTable {
  _meta?: Record<string, unknown>;
  lookup: Record<string, string[]>;
}

export function createMappingTableExpander(
  tablePath: string,
  options?: { maxExpansions?: number }
): QueryExpanderFn {
  const max = options?.maxExpansions ?? 20;
  const raw = JSON.parse(fs.readFileSync(tablePath, "utf-8")) as MappingTable;
  const lookup = raw.lookup ?? raw;

  // Build reverse index for O(1) lookup
  const index = new Map<string, Set<string>>();
  for (const [key, values] of Object.entries(lookup)) {
    if (key === "_meta") continue;
    if (!index.has(key)) index.set(key, new Set());
    for (const v of values) {
      index.get(key)!.add(v);
      if (!index.has(v)) index.set(v, new Set());
      index.get(v)!.add(key);
    }
  }

  return (query: string): string[] => {
    const expansions = new Set<string>();

    // CJK substring matching
    for (const match of query.matchAll(CJK_RE)) {
      const cjk = match[0];
      // Try full phrase, then 4→3→2 char substrings
      for (let len = cjk.length; len >= 2; len--) {
        for (let start = 0; start <= cjk.length - len; start++) {
          const sub = cjk.substring(start, start + len);
          const hits = index.get(sub);
          if (hits) {
            for (const h of hits) {
              if (expansions.size >= max) break;
              expansions.add(h);
            }
          }
        }
      }
    }

    // EN word matching
    for (const word of query.toLowerCase().match(/[a-z]{2,}/g) ?? []) {
      const hits = index.get(word);
      if (hits) {
        for (const h of hits) {
          if (expansions.size >= max) break;
          expansions.add(h);
        }
      }
    }

    if (expansions.size === 0) return [];
    // Return expanded query as single string (BM25 treats as OR)
    return [`${query} ${[...expansions].join(" ")}`];
  };
}
```

#### 4. Plugin config

```jsonc
// openclaw.json — plugin configuration
{
  "memory": {
    "provider": "memory-lancedb-pro",
    "queryExpansion": {
      "enabled": true,                    // default: false (opt-in)
      "mappingTable": "builtin",          // "builtin" ships with plugin (~37KB)
                                          // or absolute path to custom JSON
      "maxExpansions": 20                  // cap expansion terms per query
    },
    // existing queryTranslator config still works alongside
    "queryTranslator": { ... }
  }
}
```

### 🔄 Adaptive Mapping via Query History (User-Side Enhancement)

A **cron script** mines `memory_recall` query history to build a **personalized** mapping table — only terms the user actually uses, not the full 49K dictionary.

**Pipeline:**

```
Session JSONL → jieba word segmentation → MUSE base table lookup
                                        → Ollama batch translate (unmatched)
                                        → user-mapping.json
```

**Why jieba?** Without word segmentation, CJK strings get split into meaningless 2-char fragments (`道发`, `臟疾`). With jieba: **189 fragments → 54 meaningful words** (−71%), translations needed **150 → 18** (−88%), **zero garbage output**.

| | Without jieba | With jieba |
|---|---|---|
| Terms extracted | 189 | **54** (−71%) |
| Base table hits | 39 | **36** |
| Need translation | 150 | **18** (−88%) |
| Garbage terms | `道发→Dao Fa`, `臟疾→Disease` | **0** ✅ |
| Translation quality | Mixed | **18/18 meaningful** ✅ |

**Cron script concept** (`update-mapping-from-history.py`):

```python
# 1. Scan session JSONL for memory_recall queries
queries = extract_queries("~/.openclaw/agents/main/sessions/")
# → Found 31 queries (16 with CJK, 51%)

# 2. Word segmentation with jieba
terms = extract_cjk_terms(queries, use_jieba=True)
# → 54 meaningful CJK terms

# 3. Look up in MUSE base table (49K EN↔TC↔SC entries)
matched, unmatched = match_terms(terms, base_table)
# → 36 matched, 18 unmatched

# 4. Batch-translate unmatched via Ollama (optional)
translated = translate_batch(unmatched, model="rinex20/translategemma3:12b")
# → 18 new translations, all meaningful

# 5. Merge into user-mapping.json
merge_tables(base_table, {**matched, **translated}, "user-mapping.json")
```

**Schedule:** Daily at 2AM. Each run is incremental — new terms added, existing preserved.

### 📊 Performance Comparison — All Approaches Tested

| Approach | BM25 Hits | Latency | Dependencies | Migration | False Pos | Cost |
|----------|-----------|---------|-------------|-----------|-----------|------|
| Baseline (simple) | 8/24 | 0ms | none | — | low | $0 |
| **⭐ Static mapping table** | **17/24** | **~0ms** | **none** | **none** | **low** | **$0** |
| Write-time enrichment | ~18/24 | 0ms read | LLM at write | re-index all | low | LLM per write |
| LLM query translation | ~16/24 | +2-4s | Ollama 7GB | none | low | LLM per query |
| ngram(2,3) tokenizer | ~20/24 | 0ms | upstream fix | re-index | ⚠️ medium | $0 |
| **Mapping + ngram** | **~22-24** | **~0ms** | **upstream** | **re-index** | **medium** | **$0** |

### ⚡ Why This Over Other Approaches?

| | Mapping Table | Existing `QueryTranslatorFn` |
|---|---|---|
| Latency | **~0ms** (hash lookup) | +2-4s (LLM inference) |
| Dependencies | **None** (JSON file) | Ollama + 7GB model |
| Offline | **Yes** | Needs running Ollama |
| Deterministic | **Yes** | No (LLM varies) |
| Cost per query | **$0** | GPU time |
| Coexistence | ✅ Can use both | ✅ Can use both |

The two are **complementary**: mapping table handles known terms instantly, `QueryTranslatorFn` handles novel/creative queries via LLM. Users can enable either or both.

### ⚠️ Known Limitations

1. **Cannot fix same-script CJK matching** (TC→TC, SC→SC) — `simple` tokenizer still treats `伺服器備份設定` as one token. Needs ngram tokenizer ([lancedb#1315](https://github.com/lancedb/lancedb/issues/1315), [lancedb#2329](https://github.com/lancedb/lancedb/issues/2329)).
2. **Static vocabulary** — new terms need table updates (mitigated by cron script).
3. **Expansion can dilute BM25 ranking** — more tokens = lower per-token weight. Mitigated by `maxExpansions` cap (default 20).

**Mapping table and ngram solve orthogonal problems:**
- Mapping table → **cross-script** (SC↔TC, EN↔CJK)
- ngram → **same-script CJK** partial matching

### 📝 Reproduction

| Resource | URL |
|----------|-----|
| Full reproduction repo | https://github.com/andychu666/memory-lancedb-pro-bm25-cjk-repro |
| Mapping table data | `mapping-tables/` in repro repo |
| Benchmark scripts | `repro-mapping-table.sh`, `benchmark-ab-test.py` |
| Cron script | `update-mapping-from-history.py` |
| Live A/B test | `benchmark-ab-test.py` (10 real memories, 8 query pairs) |
| MUSE dictionaries | https://github.com/facebookresearch/MUSE |

### 🔗 Related

- [#292](https://github.com/CortexReach/memory-lancedb-pro/pull/292) — `expandQuery()` with hardcoded synonyms (this proposal generalizes it)
- [#112](https://github.com/CortexReach/memory-lancedb-pro/pull/112) — Original `expandQuery()` PR (merged to `main`)
- [#271](https://github.com/CortexReach/memory-lancedb-pro/issues/271) — CJK BM25 cross-script issue (LLM translation approach)
- [lancedb#1315](https://github.com/lancedb/lancedb/issues/1315) — Enable choosing tokenizer in LanceDB FTS
- [lancedb#2329](https://github.com/lancedb/lancedb/issues/2329) — Support Chinese/CJK text search in BM25

### 🎯 Summary

| Metric | Value |
|--------|-------|
| Affected queries | **51%** of bilingual user queries |
| BM25 improvement | **+83% to +112%** (verified) |
| Implementation size | ~150 LOC (new expander) + JSON table |
| Query latency impact | **~0ms** |
| External dependencies | **None** |
| Migration cost | **None** (works on existing memories) |
| Coexists with QueryTranslatorFn | **Yes** |

**Plugin version:** 1.1.0-beta.8  
**Test date:** 2026-03-21  
**Tester:** [@andychu666](https://github.com/andychu666)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable mapping table for cross-script BM25 query expansion (generalizes #292) #297

🚀 Static Mapping Table for Cross-Script BM25 Query Expansion — Rescue 51% of Bilingual Queries (Verified)

🎯 Problem

📊 Verified Test Results

💡 Proposed Solution: Generalize `expandQuery()` with Configurable Mapping Table

🔧 Implementation

1. New `QueryExpanderFn` type

2. Integration in `fusedSearch()`

3. Static mapping table expander

4. Plugin config

🔄 Adaptive Mapping via Query History (User-Side Enhancement)

📊 Performance Comparison — All Approaches Tested

⚡ Why This Over Other Approaches?

⚠️ Known Limitations

📝 Reproduction

🔗 Related

🎯 Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Script	Count	%	BM25 Status
EN only	15	48%	✅ Works (EN memories match)
CJK only	11	35%	❌ BM25=0 against EN memories
Mixed CJK+EN	5	16%	⚠️ Partial (only EN tokens match)
SC→TC cross-script	2	6%	❌ BM25=0 against TC memories

Direction	Baseline	With Mapping Table	Δ
EN→TC	8/8	8/8	—
SC→TC	0/8	7/8	+7 rescued ✅
TC→TC	0/8	2/8	+2 (limited by tokenizer)
Total BM25 hits	8/24	17/24	+112%

	Baseline	With Mapping Table	Δ
BM25 > 0	6/16	11/16	+83%
Queries rescued	—	5	from 0.000 to 0.09-0.40

Query (CJK)	BM25 after	Matched Memory
`频道发文规则` (SC)	0.111	Discord 頻道發文規則… (TC)
`每日新聞摘要`	0.286	Daily news digest cron… (EN)
`腎臟疾病鈉攝取`	0.364	Kidney disease sodium… (EN)
`汽車電瓶換電瓶`	0.400	P2135 throttle body… Battery… (EN)
`使用者的程式風格偏好`	0.095	User prefers interactive… (EN)

	Without jieba	With jieba
Terms extracted	189	54 (−71%)
Base table hits	39	36
Need translation	150	18 (−88%)
Garbage terms	`道发→Dao Fa`, `臟疾→Disease`	0 ✅
Translation quality	Mixed	18/18 meaningful ✅

Approach	BM25 Hits	Latency	Dependencies	Migration	False Pos	Cost
Baseline (simple)	8/24	0ms	none	—	low	$0
⭐ Static mapping table	17/24	~0ms	none	none	low	$0
Write-time enrichment	~18/24	0ms read	LLM at write	re-index all	low	LLM per write
LLM query translation	~16/24	+2-4s	Ollama 7GB	none	low	LLM per query
ngram(2,3) tokenizer	~20/24	0ms	upstream fix	re-index	⚠️ medium	$0
Mapping + ngram	~22-24	~0ms	upstream	re-index	medium	$0

	Mapping Table	Existing `QueryTranslatorFn`
Latency	~0ms (hash lookup)	+2-4s (LLM inference)
Dependencies	None (JSON file)	Ollama + 7GB model
Offline	Yes	Needs running Ollama
Deterministic	Yes	No (LLM varies)
Cost per query	$0	GPU time
Coexistence	✅ Can use both	✅ Can use both

Resource	URL
Full reproduction repo	https://github.com/andychu666/memory-lancedb-pro-bm25-cjk-repro
Mapping table data	`mapping-tables/` in repro repo
Benchmark scripts	`repro-mapping-table.sh`, `benchmark-ab-test.py`
Cron script	`update-mapping-from-history.py`
Live A/B test	`benchmark-ab-test.py` (10 real memories, 8 query pairs)
MUSE dictionaries	https://github.com/facebookresearch/MUSE

Metric	Value
Affected queries	51% of bilingual user queries
BM25 improvement	+83% to +112% (verified)
Implementation size	~150 LOC (new expander) + JSON table
Query latency impact	~0ms
External dependencies	None
Migration cost	None (works on existing memories)
Coexists with QueryTranslatorFn	Yes

feat: configurable mapping table for cross-script BM25 query expansion (generalizes #292) #297

Description

🚀 Static Mapping Table for Cross-Script BM25 Query Expansion — Rescue 51% of Bilingual Queries (Verified)

🎯 Problem

📊 Verified Test Results

💡 Proposed Solution: Generalize expandQuery() with Configurable Mapping Table

🔧 Implementation

1. New QueryExpanderFn type

2. Integration in fusedSearch()

3. Static mapping table expander

4. Plugin config

🔄 Adaptive Mapping via Query History (User-Side Enhancement)

📊 Performance Comparison — All Approaches Tested

⚡ Why This Over Other Approaches?

⚠️ Known Limitations

📝 Reproduction

🔗 Related

🎯 Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

💡 Proposed Solution: Generalize `expandQuery()` with Configurable Mapping Table

1. New `QueryExpanderFn` type

2. Integration in `fusedSearch()`