Skip to content

feat: configurable mapping table for cross-script BM25 query expansion (generalizes #292)ย #297

@andychu666

Description

@andychu666

๐Ÿš€ Static Mapping Table for Cross-Script BM25 Query Expansion โ€” Rescue 51% of Bilingual Queries (Verified)

๐ŸŽฏ Problem

The simple FTS tokenizer produces BM25=0 for cross-script queries (ENโ†”CJK, SCโ†”TC). For bilingual users โ€” whose memories are mostly English but queries often use Chinese โ€” 51% of real queries lose BM25 signal entirely, causing a 29-50% fused score penalty in hybrid retrieval.

Real-world scenario (verified with 314 live memories, 33 real queries):

Memory stored:  "Kidney disease sodium intake: limit daily sodium to 2000mg"  (EN)
Query:          "่…Ž่‡Ÿ็–พ็—…้ˆ‰ๆ”ๅ–"                                                (TC)
BM25 result:    0.00 โŒ (no token overlap between scripts)

Query distribution from actual memory_recall usage:

Script Count % BM25 Status
EN only 15 48% โœ… Works (EN memories match)
CJK only 11 35% โŒ BM25=0 against EN memories
Mixed CJK+EN 5 16% โš ๏ธ Partial (only EN tokens match)
SCโ†’TC cross-script 2 6% โŒ BM25=0 against TC memories

Result: 51% of real queries lose BM25 signal due to script mismatch.

๐Ÿ“Š Verified Test Results

A. 24-query controlled benchmark (16 bilingual TC+EN memories):

Direction Baseline With Mapping Table ฮ”
ENโ†’TC 8/8 8/8 โ€”
SCโ†’TC 0/8 7/8 +7 rescued โœ…
TCโ†’TC 0/8 2/8 +2 (limited by tokenizer)
Total BM25 hits 8/24 17/24 +112%

B. Live A/B test (10 real memories, 8 ENโ†”CJK query pairs from session history):

Baseline With Mapping Table ฮ”
BM25 > 0 6/16 11/16 +83%
Queries rescued โ€” 5 from 0.000 to 0.09-0.40

The 5 rescued queries:

Query (CJK) BM25 before BM25 after Matched Memory
้ข‘้“ๅ‘ๆ–‡่ง„ๅˆ™ (SC) 0.000 0.111 Discord ้ ป้“็™ผๆ–‡่ฆๅ‰‡โ€ฆ (TC)
ๆฏๆ—ฅๆ–ฐ่žๆ‘˜่ฆ 0.000 0.286 Daily news digest cronโ€ฆ (EN)
่…Ž่‡Ÿ็–พ็—…้ˆ‰ๆ”ๅ– 0.000 0.364 Kidney disease sodiumโ€ฆ (EN)
ๆฑฝ่ปŠ้›ป็“ถๆ›้›ป็“ถ 0.000 0.400 P2135 throttle bodyโ€ฆ Batteryโ€ฆ (EN)
ไฝฟ็”จ่€…็š„็จ‹ๅผ้ขจๆ ผๅๅฅฝ 0.000 0.095 User prefers interactiveโ€ฆ (EN)

๐Ÿ’ก Proposed Solution: Generalize expandQuery() with Configurable Mapping Table

PR #112 / #292 proved that query expansion works โ€” expandQuery() with a hardcoded SYNONYM_MAP of ~15 colloquial Chinese synonyms (ๆŒ‚ไบ†โ†’crash, ่ธฉๅ‘โ†’troubleshoot).

This proposal generalizes that approach: replace the hardcoded SYNONYM_MAP with a configurable JSON mapping table (49K+ entries from MUSE + OpenCC), covering the cross-script BM25=0 problem โ€” a bigger issue affecting 51% of bilingual queries.

The existing expandQuery() handles same-language colloquialโ†’technical synonyms. This extends it to handle cross-script expansion (ENโ†”TCโ†”SC) using a static bilingual dictionary. Original query preserved for embedding (no dilution).

Query:    "้ข‘้“ๅ‘ๆ–‡่ง„ๅˆ™" (SC)
                โ†“ QueryExpanderFn (static lookup, ~0ms)
Expanded: "้ข‘้“ๅ‘ๆ–‡่ง„ๅˆ™ channel posting rules ้ ป้“ ็™ผๆ–‡ ่ฆๅ‰‡"
                โ†“
BM25:     matches "channel posting rules" (EN) โœ…
BM25:     matches "้ ป้“็™ผๆ–‡่ฆๅ‰‡" (TC) โœ…
Embedding: uses original "้ข‘้“ๅ‘ๆ–‡่ง„ๅˆ™" (no dilution) โœ…

๐Ÿ”ง Implementation

1. New QueryExpanderFn type

// retriever.ts โ€” new type, coexists with existing QueryTranslatorFn
export type QueryExpanderFn = (query: string) => string[];

export interface MemoryRetrieverConfig {
  // ... existing fields ...
  queryTranslator?: QueryTranslatorFn;   // existing: async LLM translation
  queryExpander?: QueryExpanderFn;        // NEW: sync static expansion
}

2. Integration in fusedSearch()

// retriever.ts fusedSearch() โ€” add before BM25 search
async fusedSearch(query, limit, scopeFilter, category) {
  const queries = [query];

  // Step 1: Static expansion (sync, ~0ms) โ€” NEW
  if (this.config.queryExpander) {
    const expanded = this.config.queryExpander(query);
    for (const e of expanded) {
      if (e.trim() && e !== query && !queries.includes(e)) {
        queries.push(e);
      }
    }
  }

  // Step 2: LLM translation (async, existing)
  if (this.config.queryTranslator) {
    const translated = await this.config.queryTranslator(query);
    for (const t of translated) {
      if (t.trim() && t !== query && !queries.includes(t)) {
        queries.push(t);
      }
    }
  }

  // Run BM25 for all variants (existing parallel logic)
  const allResults = await Promise.all(
    queries.map(q => this.store.bm25Search(q, limit, scopeFilter))
  );
  // ... existing dedup + fusion logic ...
}

3. Static mapping table expander

// query-expander.ts โ€” new file
import type { QueryExpanderFn } from "./retriever.js";

const CJK_RE = /[\u4e00-\u9fff\u3400-\u4dbf]+/g;

interface MappingTable {
  _meta?: Record<string, unknown>;
  lookup: Record<string, string[]>;
}

export function createMappingTableExpander(
  tablePath: string,
  options?: { maxExpansions?: number }
): QueryExpanderFn {
  const max = options?.maxExpansions ?? 20;
  const raw = JSON.parse(fs.readFileSync(tablePath, "utf-8")) as MappingTable;
  const lookup = raw.lookup ?? raw;

  // Build reverse index for O(1) lookup
  const index = new Map<string, Set<string>>();
  for (const [key, values] of Object.entries(lookup)) {
    if (key === "_meta") continue;
    if (!index.has(key)) index.set(key, new Set());
    for (const v of values) {
      index.get(key)!.add(v);
      if (!index.has(v)) index.set(v, new Set());
      index.get(v)!.add(key);
    }
  }

  return (query: string): string[] => {
    const expansions = new Set<string>();

    // CJK substring matching
    for (const match of query.matchAll(CJK_RE)) {
      const cjk = match[0];
      // Try full phrase, then 4โ†’3โ†’2 char substrings
      for (let len = cjk.length; len >= 2; len--) {
        for (let start = 0; start <= cjk.length - len; start++) {
          const sub = cjk.substring(start, start + len);
          const hits = index.get(sub);
          if (hits) {
            for (const h of hits) {
              if (expansions.size >= max) break;
              expansions.add(h);
            }
          }
        }
      }
    }

    // EN word matching
    for (const word of query.toLowerCase().match(/[a-z]{2,}/g) ?? []) {
      const hits = index.get(word);
      if (hits) {
        for (const h of hits) {
          if (expansions.size >= max) break;
          expansions.add(h);
        }
      }
    }

    if (expansions.size === 0) return [];
    // Return expanded query as single string (BM25 treats as OR)
    return [`${query} ${[...expansions].join(" ")}`];
  };
}

4. Plugin config

// openclaw.json โ€” plugin configuration
{
  "memory": {
    "provider": "memory-lancedb-pro",
    "queryExpansion": {
      "enabled": true,                    // default: false (opt-in)
      "mappingTable": "builtin",          // "builtin" ships with plugin (~37KB)
                                          // or absolute path to custom JSON
      "maxExpansions": 20                  // cap expansion terms per query
    },
    // existing queryTranslator config still works alongside
    "queryTranslator": { ... }
  }
}

๐Ÿ”„ Adaptive Mapping via Query History (User-Side Enhancement)

A cron script mines memory_recall query history to build a personalized mapping table โ€” only terms the user actually uses, not the full 49K dictionary.

Pipeline:

Session JSONL โ†’ jieba word segmentation โ†’ MUSE base table lookup
                                        โ†’ Ollama batch translate (unmatched)
                                        โ†’ user-mapping.json

Why jieba? Without word segmentation, CJK strings get split into meaningless 2-char fragments (้“ๅ‘, ่‡Ÿ็–พ). With jieba: 189 fragments โ†’ 54 meaningful words (โˆ’71%), translations needed 150 โ†’ 18 (โˆ’88%), zero garbage output.

Without jieba With jieba
Terms extracted 189 54 (โˆ’71%)
Base table hits 39 36
Need translation 150 18 (โˆ’88%)
Garbage terms ้“ๅ‘โ†’Dao Fa, ่‡Ÿ็–พโ†’Disease 0 โœ…
Translation quality Mixed 18/18 meaningful โœ…

Cron script concept (update-mapping-from-history.py):

# 1. Scan session JSONL for memory_recall queries
queries = extract_queries("~/.openclaw/agents/main/sessions/")
# โ†’ Found 31 queries (16 with CJK, 51%)

# 2. Word segmentation with jieba
terms = extract_cjk_terms(queries, use_jieba=True)
# โ†’ 54 meaningful CJK terms

# 3. Look up in MUSE base table (49K ENโ†”TCโ†”SC entries)
matched, unmatched = match_terms(terms, base_table)
# โ†’ 36 matched, 18 unmatched

# 4. Batch-translate unmatched via Ollama (optional)
translated = translate_batch(unmatched, model="rinex20/translategemma3:12b")
# โ†’ 18 new translations, all meaningful

# 5. Merge into user-mapping.json
merge_tables(base_table, {**matched, **translated}, "user-mapping.json")

Schedule: Daily at 2AM. Each run is incremental โ€” new terms added, existing preserved.

๐Ÿ“Š Performance Comparison โ€” All Approaches Tested

Approach BM25 Hits Latency Dependencies Migration False Pos Cost
Baseline (simple) 8/24 0ms none โ€” low $0
โญ Static mapping table 17/24 ~0ms none none low $0
Write-time enrichment ~18/24 0ms read LLM at write re-index all low LLM per write
LLM query translation ~16/24 +2-4s Ollama 7GB none low LLM per query
ngram(2,3) tokenizer ~20/24 0ms upstream fix re-index โš ๏ธ medium $0
Mapping + ngram ~22-24 ~0ms upstream re-index medium $0

โšก Why This Over Other Approaches?

Mapping Table Existing QueryTranslatorFn
Latency ~0ms (hash lookup) +2-4s (LLM inference)
Dependencies None (JSON file) Ollama + 7GB model
Offline Yes Needs running Ollama
Deterministic Yes No (LLM varies)
Cost per query $0 GPU time
Coexistence โœ… Can use both โœ… Can use both

The two are complementary: mapping table handles known terms instantly, QueryTranslatorFn handles novel/creative queries via LLM. Users can enable either or both.

โš ๏ธ Known Limitations

  1. Cannot fix same-script CJK matching (TCโ†’TC, SCโ†’SC) โ€” simple tokenizer still treats ไผบๆœๅ™จๅ‚™ไปฝ่จญๅฎš as one token. Needs ngram tokenizer (lancedb#1315, lancedb#2329).
  2. Static vocabulary โ€” new terms need table updates (mitigated by cron script).
  3. Expansion can dilute BM25 ranking โ€” more tokens = lower per-token weight. Mitigated by maxExpansions cap (default 20).

Mapping table and ngram solve orthogonal problems:

  • Mapping table โ†’ cross-script (SCโ†”TC, ENโ†”CJK)
  • ngram โ†’ same-script CJK partial matching

๐Ÿ“ Reproduction

Resource URL
Full reproduction repo https://github.com/andychu666/memory-lancedb-pro-bm25-cjk-repro
Mapping table data mapping-tables/ in repro repo
Benchmark scripts repro-mapping-table.sh, benchmark-ab-test.py
Cron script update-mapping-from-history.py
Live A/B test benchmark-ab-test.py (10 real memories, 8 query pairs)
MUSE dictionaries https://github.com/facebookresearch/MUSE

๐Ÿ”— Related

  • #292 โ€” expandQuery() with hardcoded synonyms (this proposal generalizes it)
  • #112 โ€” Original expandQuery() PR (merged to main)
  • #271 โ€” CJK BM25 cross-script issue (LLM translation approach)
  • lancedb#1315 โ€” Enable choosing tokenizer in LanceDB FTS
  • lancedb#2329 โ€” Support Chinese/CJK text search in BM25

๐ŸŽฏ Summary

Metric Value
Affected queries 51% of bilingual user queries
BM25 improvement +83% to +112% (verified)
Implementation size ~150 LOC (new expander) + JSON table
Query latency impact ~0ms
External dependencies None
Migration cost None (works on existing memories)
Coexists with QueryTranslatorFn Yes

Plugin version: 1.1.0-beta.8
Test date: 2026-03-21
Tester: @andychu666

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions