diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..881331c --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,9 @@ +# Changelog + +## Unreleased + +- Add support for Office files, source code, and config files. +- Allow ingesting directories with the ingest_file tool. +- Add custom parser config support and a sample parser. +- Add a companion CLI for bulk ingest to avoid MCP tool timeouts. +- Skip common dependency/build directories by default during directory scans (MCP + CLI). diff --git a/README.md b/README.md index 81042aa..855e5e2 100644 --- a/README.md +++ b/README.md @@ -25,9 +25,13 @@ Semantic search with keyword boost for exact technical terms — fully private, - **Zero-friction setup** One `npx` command. No Docker, no Python, no servers to manage. Designed for Cursor, Codex, and Claude Code via MCP. +- **More file formats** + Supports Office files (DOCX, PPTX, XLSX/XLS), source code, and common config files (JSON, YAML, INI, TOML). + ## Quick Start Set `BASE_DIR` to the folder you want to search. Documents must live under it. +To index Downloads, Documents, and Desktop, set `BASE_DIR` to your home folder. Add the MCP server to your AI coding tool: @@ -88,16 +92,105 @@ You want AI to search your documents—technical specs, research papers, interna ## Usage -The server provides 6 MCP tools: ingest file, ingest data, search, list, delete, status +The server provides 6 MCP tools: ingest file (also supports directories), ingest data, search, list, delete, status (`ingest_file`, `ingest_data`, `query_documents`, `list_files`, `delete_file`, `status`). +### Bulk Ingest (CLI) + +For large collections (tens of thousands of files), use the companion CLI to avoid MCP timeouts: + +``` +npx mcp-local-rag ingest --path /Users/me/Desktop +``` + +Common options: + +``` +npx mcp-local-rag ingest --path /Users/me/Desktop --extensions .pdf,.md +npx mcp-local-rag ingest --path /Users/me/Desktop --exclude node_modules,dist +npx mcp-local-rag ingest --path /Users/me/Desktop --no-recursive --dry-run +``` + +The CLI reuses the same parser/chunker/embedder pipeline as the MCP server, but runs directly (no tool timeout). +By default it skips files already indexed; use `--force` to re-ingest and replace existing chunks. +Directory scans also skip common dependency/build folders by default (applies to MCP and CLI): +`.git`, `node_modules`, `dist`, `build`, `out`, `.next`, `.nuxt`, `.svelte-kit`, `target`, `.gradle`, +`.mvn`, `bin`, `obj`, `.vs`, `__pycache__`, `.venv`/`venv`, `coverage`, `vendor`, `.cache`. +Use `--exclude` to add more; you can still ingest a specific file inside excluded folders by passing its path. +Custom parsers work here too: set `MCP_LOCAL_RAG_PARSERS` or pass `--parsers` to the CLI. + ### Ingesting Documents ``` "Ingest the document at /Users/me/docs/api-spec.pdf" ``` -Supports PDF, DOCX, TXT, and Markdown. The server extracts text, splits it into chunks, generates embeddings locally, and stores everything in a local vector database. +Supports PDF, DOCX, PPTX, XLSX/XLS, TXT, Markdown, source code, and common config files (JSON, YAML, INI, TOML). The server extracts text, splits it into chunks, generates embeddings locally, and stores everything in a local vector database. + +You can also ingest a full directory: + +``` +"Ingest /Users/me/Downloads" +``` + +For large folders, you can limit to specific extensions (for example: `.md`, `.ts`) and control recursion. +Directory scans skip common dependency/build folders by default (see above). + +### Custom Parsers + +For new file types, add a custom parser. Create a JSON file anywhere and set +`MCP_LOCAL_RAG_PARSERS` to the **absolute path** in your MCP client config. +If you run from source, you can also use `config/file_parsers.json` from the repo root. + +Example (place this file somewhere you control, e.g. `~/.config/mcp-local-rag/file_parsers.json`): + +```json +{ + ".note": { + "module": "/Users/me/.config/mcp-local-rag/parsers/note-parser.js", + "export": "parseFile" + } +} +``` + +Add `MCP_LOCAL_RAG_PARSERS=/Users/me/.config/mcp-local-rag/file_parsers.json` +alongside `BASE_DIR` in your MCP client config. + +Minimal parser example (ESM): + +```js +// /Users/me/.config/mcp-local-rag/parsers/note-parser.mjs +import { readFile } from 'node:fs/promises' + +export async function parseFile(filePath) { + const raw = await readFile(filePath, 'utf-8') + return raw +} +``` + +CommonJS is fine too: + +```js +// /Users/me/.config/mcp-local-rag/parsers/note-parser.cjs +const { readFile } = require('node:fs/promises') + +async function parseFile(filePath) { + const raw = await readFile(filePath, 'utf-8') + return raw +} + +module.exports = { parseFile } +``` + +Notes: +- The parser must return a string (the extracted text). +- Use absolute paths in `file_parsers.json` to avoid working-directory issues. +- If you need extra npm dependencies, install them alongside the parser or bundle it. +- Restart the MCP server after changing the parser config. +- If a custom parser fails to load, the server returns a clear error with fix hints (for example, missing dependency and install command). + +You can also use the included sample parser at `src/parser/custom/sample-note-parser.ts` +(build it to JS and reference the compiled file). Re-ingesting the same file replaces the old version automatically. @@ -174,7 +267,7 @@ Keyword boost is applied *after* semantic filtering, so it improves precision wi ### Details -When you ingest a document, the parser extracts text based on file type (PDF via `pdfjs-dist`, DOCX via `mammoth`, text files directly). +When you ingest a document, the parser extracts text based on file type (PDF via `pdfjs-dist`, DOCX via `mammoth`, PPTX via XML extraction, XLSX via `xlsx`, text/code/config files directly). The semantic chunker splits text into sentences, then groups them using embedding similarity. It finds natural topic boundaries where the meaning shifts—keeping related content together instead of cutting at arbitrary character limits. This produces chunks that are coherent units of meaning, typically 500-1000 characters. Markdown code blocks are kept intact—never split mid-block—preserving copy-pastable code in search results. @@ -356,7 +449,7 @@ Yes, after the first model download (~90MB). Cloud services offer better accuracy at scale but require sending data externally. This trades some accuracy for complete privacy and zero runtime cost. **What file formats are supported?** -PDF, DOCX, TXT, Markdown, and HTML (via `ingest_data`). Not yet: Excel, PowerPoint, images. +PDF, DOCX, PPTX, XLSX/XLS, TXT, Markdown, source code, JSON/YAML/TOML/INI, and HTML (via `ingest_data`). **Can I change the embedding model?** Yes, but you must delete your database and re-ingest all documents. Different models produce incompatible vector dimensions. @@ -405,7 +498,7 @@ pnpm run check:all # Full quality check src/ index.ts # Entry point server/ # MCP tool handlers - parser/ # PDF, DOCX, TXT, MD parsing + parser/ # PDF, DOCX, PPTX, XLSX, and text parsing chunker/ # Text splitting embedder/ # Transformers.js embeddings vectordb/ # LanceDB operations diff --git a/config/file_parsers.json b/config/file_parsers.json new file mode 100644 index 0000000..0967ef4 --- /dev/null +++ b/config/file_parsers.json @@ -0,0 +1 @@ +{} diff --git a/package.json b/package.json index 3303bda..a584f8e 100644 --- a/package.json +++ b/package.json @@ -66,10 +66,12 @@ "@lancedb/lancedb": "^0.22.2", "@modelcontextprotocol/sdk": "^1.25.1", "@mozilla/readability": "0.6.0", + "jszip": "^3.10.1", "jsdom": "^27.4.0", "mammoth": "^1.11.0", "pdfjs-dist": "^5.4.530", - "turndown": "7.2.2" + "turndown": "7.2.2", + "xlsx": "^0.18.5" }, "devDependencies": { "@biomejs/biome": "^1.9.4", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 6ce8824..64ed669 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -23,6 +23,9 @@ importers: jsdom: specifier: ^27.4.0 version: 27.4.0 + jszip: + specifier: ^3.10.1 + version: 3.10.1 mammoth: specifier: ^1.11.0 version: 1.11.0 @@ -32,6 +35,9 @@ importers: turndown: specifier: 7.2.2 version: 7.2.2 + xlsx: + specifier: ^0.18.5 + version: 0.18.5 devDependencies: '@biomejs/biome': specifier: ^1.9.4 @@ -979,6 +985,10 @@ packages: engines: {node: '>=0.4.0'} hasBin: true + adler-32@1.3.1: + resolution: {integrity: sha512-ynZ4w/nUUv5rrsR8UUGoe1VC9hZj6V5hU9Qw1HlMDJGEJw5S7TfTErWTjMys6M7vr0YWcPqs3qAr4ss0nDfP+A==} + engines: {node: '>=0.8'} + agent-base@7.1.4: resolution: {integrity: sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ==} engines: {node: '>= 14'} @@ -1114,6 +1124,10 @@ packages: resolution: {integrity: sha512-P8BjAsXvZS+VIDUI11hHCQEv74YT67YUi5JJFNWIqL235sBmjX4+qx9Muvls5ivyNENctx46xQLQ3aTuE7ssaQ==} engines: {node: '>=6'} + cfb@1.2.2: + resolution: {integrity: sha512-KfdUZsSOw19/ObEWasvBP/Ac4reZvAGauZhs6S/gqNhXhI7cKwvlH7ulj+dOEYnca4bm4SGo8C1bTAQvnTjgQA==} + engines: {node: '>=0.8'} + chai@5.3.3: resolution: {integrity: sha512-4zNhdJD/iOjSH0A05ea+Ke6MU5mmpQcbQsSOkgdaUMJ9zTlDTD/GYlwohmIE2u0gaxHYiVHEn1Fw9mZ/ktJWgw==} engines: {node: '>=18'} @@ -1161,6 +1175,10 @@ packages: code-block-writer@11.0.3: resolution: {integrity: sha512-NiujjUFB4SwScJq2bwbYUtXbZhBSlY6vYzm++3Q6oC+U+injTqfPYFK8wS9COOmb2lueqp0ZRB4nK1VYeHgNyw==} + codepage@1.15.0: + resolution: {integrity: sha512-3g6NUTPd/YtuuGrhMnOMRjFc+LJw/bnMp3+0r/Wcz3IXUuCosKRJvMphm5+Q+bvTVGcJJuRvVLuYba+WojaFaA==} + engines: {node: '>=0.8'} + color-convert@2.0.1: resolution: {integrity: sha512-RRECPsj7iu/xb5oKYcsFHSppFNnsj/52OVTRKb4zP5onXwVF3zVmmToNcOfGC+CRDpfK/U584fMg38ZHCaElKQ==} engines: {node: '>=7.0.0'} @@ -1232,6 +1250,11 @@ packages: resolution: {integrity: sha512-AdmX6xUzdNASswsFtmwSt7Vj8po9IuqXm0UXz7QKPuEUmPB4XyjGfaAr2PSuELMwkRMVH1EpIkX5bTZGRB3eCA==} engines: {node: '>=10'} + crc-32@1.2.2: + resolution: {integrity: sha512-ROmzCKrTnOwybPcJApAA6WBWij23HVfGVNKqqrZpuyZOHqK2CwHSvpGuyt/UNNvaIjEd8X5IFGp4Mh+Ie1IHJQ==} + engines: {node: '>=0.8'} + hasBin: true + create-require@1.1.1: resolution: {integrity: sha512-dcKFX3jn0MpIaXjisoRvexIJVEKzaq7z2rZKxf+MSr9TkdmHmsU4m2lcLojrj/FHl8mk5VxMmYA+ftRkP/3oKQ==} @@ -1530,6 +1553,10 @@ packages: resolution: {integrity: sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow==} engines: {node: '>= 0.6'} + frac@1.1.2: + resolution: {integrity: sha512-w/XBfkibaTl3YDqASwfDUqkna4Z2p9cFSr1aHDt0WoMTECnRfBOv2WArlZILlqgWlmdIlALXGpM2AOhEk5W3IA==} + engines: {node: '>=0.8'} + fresh@2.0.0: resolution: {integrity: sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A==} engines: {node: '>= 0.8'} @@ -2326,6 +2353,10 @@ packages: sprintf-js@1.1.3: resolution: {integrity: sha512-Oo+0REFV59/rz3gfJNKQiBlwfHaSESl1pcGyABQsnnIfWOFt6JNj5gCog2U6MLZ//IGYD+nA8nI+mTShREReaA==} + ssf@0.11.2: + resolution: {integrity: sha512-+idbmIXoYET47hH+d7dfm2epdOMUDjqcB4648sTZ+t2JwoyBFL/insLfB/racrDmsKB3diwsDA696pZMieAC5g==} + engines: {node: '>=0.8'} + stackback@0.0.2: resolution: {integrity: sha512-1XMJE5fQo1jGH6Y/7ebnwPOBEkIEnT4QF32d5R1+VXdXveM0IBMJt8zfaxX1P3QhVwrYe+576+jkANtSS2mBbw==} @@ -2659,6 +2690,14 @@ packages: engines: {node: '>=8'} hasBin: true + wmf@1.0.2: + resolution: {integrity: sha512-/p9K7bEh0Dj6WbXg4JG0xvLQmIadrner1bi45VMJTfnbVHsc7yIajZyoSoK60/dtVBs12Fm6WkUI5/3WAVsNMw==} + engines: {node: '>=0.8'} + + word@0.3.0: + resolution: {integrity: sha512-OELeY0Q61OXpdUfTp+oweA/vtLVg5VDOXh+3he3PNzLGG/y0oylSOC1xRVj0+l4vQ3tj/bB1HVHv1ocXkQceFA==} + engines: {node: '>=0.8'} + wordwrapjs@5.1.1: resolution: {integrity: sha512-0yweIbkINJodk27gX9LBGMzyQdBDan3s/dEAiwBOj+Mf0PPyWL6/rikalkv8EeD0E8jm4o5RXEOrFTP3NXbhJg==} engines: {node: '>=12.17'} @@ -2682,6 +2721,11 @@ packages: utf-8-validate: optional: true + xlsx@0.18.5: + resolution: {integrity: sha512-dmg3LCjBPHZnQp5/F/+nnTa+miPJxUXB6vtk42YjBBKayDNagxGEeIdWApkYPOf3Z3pm3k62Knjzp7lMeTEtFQ==} + engines: {node: '>=0.8'} + hasBin: true + xml-name-validator@5.0.0: resolution: {integrity: sha512-EvGK8EJ3DhaHfbRlETOWAS5pO9MZITeauHKJyb8wyajUfQUenkIg2MvLDTZ4T/TgIcm3HU0TFBgWWboAZ30UHg==} engines: {node: '>=18'} @@ -3442,6 +3486,8 @@ snapshots: acorn@8.15.0: {} + adler-32@1.3.1: {} + agent-base@7.1.4: {} ajv-formats@3.0.1(ajv@8.17.1): @@ -3574,6 +3620,11 @@ snapshots: callsites@3.1.0: {} + cfb@1.2.2: + dependencies: + adler-32: 1.3.1 + crc-32: 1.2.2 + chai@5.3.3: dependencies: assertion-error: 2.0.1 @@ -3626,6 +3677,8 @@ snapshots: code-block-writer@11.0.3: {} + codepage@1.15.0: {} + color-convert@2.0.1: dependencies: color-name: 1.1.4 @@ -3685,6 +3738,8 @@ snapshots: path-type: 4.0.0 yaml: 1.10.2 + crc-32@1.2.2: {} + create-require@1.1.1: {} cross-spawn@7.0.6: @@ -4026,6 +4081,8 @@ snapshots: forwarded@0.2.0: {} + frac@1.1.2: {} + fresh@2.0.0: {} fs.realpath@1.0.0: {} @@ -4904,6 +4961,10 @@ snapshots: sprintf-js@1.1.3: {} + ssf@0.11.2: + dependencies: + frac: 1.1.2 + stackback@0.0.2: {} statuses@2.0.2: {} @@ -5228,6 +5289,10 @@ snapshots: siginfo: 2.0.0 stackback: 0.0.2 + wmf@1.0.2: {} + + word@0.3.0: {} + wordwrapjs@5.1.1: {} wrap-ansi@9.0.2: @@ -5240,6 +5305,16 @@ snapshots: ws@8.18.3: {} + xlsx@0.18.5: + dependencies: + adler-32: 1.3.1 + cfb: 1.2.2 + codepage: 1.15.0 + crc-32: 1.2.2 + ssf: 0.11.2 + wmf: 1.0.2 + word: 0.3.0 + xml-name-validator@5.0.0: {} xmlbuilder@10.1.1: {} diff --git a/scripts/check-unused-exports.js b/scripts/check-unused-exports.js index 07b9df0..e647a85 100755 --- a/scripts/check-unused-exports.js +++ b/scripts/check-unused-exports.js @@ -10,7 +10,7 @@ const { execSync } = require('child_process') try { // Run ts-prune const output = execSync( - 'npx ts-prune --project tsconfig.json --ignore "src/index.ts|__tests__|test|vitest"', + 'npx ts-prune --project tsconfig.json --ignore "src/index.ts|__tests__|test|vitest|src/parser/custom"', { encoding: 'utf8' } ) @@ -66,4 +66,4 @@ try { } catch (error) { console.error('Error occurred:', error.message) process.exit(1) -} \ No newline at end of file +} diff --git a/server.json b/server.json index 5194bcf..d36e9e7 100644 --- a/server.json +++ b/server.json @@ -54,6 +54,13 @@ "format": "string", "isSecret": false }, + { + "name": "MCP_LOCAL_RAG_PARSERS", + "description": "Path to a JSON file that maps file extensions to custom parsers", + "isRequired": false, + "format": "string", + "isSecret": false + }, { "name": "RAG_MAX_DISTANCE", "description": "Maximum distance threshold for filtering search results. Results with distance greater than this value will be excluded. Lower values mean stricter filtering (e.g., 0.5 for high relevance only)", diff --git a/src/bin/ingest.ts b/src/bin/ingest.ts new file mode 100644 index 0000000..7719794 --- /dev/null +++ b/src/bin/ingest.ts @@ -0,0 +1,560 @@ +/** + * MCP Local RAG Bulk Ingest CLI + * + * Ingests large folders without MCP tool timeouts. + * + * Usage: + * npx mcp-local-rag ingest --path /Users/me/Desktop + * npx mcp-local-rag ingest --path /Users/me/Desktop --extensions .pdf,.md + * npx mcp-local-rag ingest --path /Users/me/Desktop --no-recursive --dry-run + */ + +import { randomUUID } from 'node:crypto' +import { stat } from 'node:fs/promises' +import { dirname, resolve } from 'node:path' +import { SemanticChunker } from '../chunker/index.js' +import { Embedder } from '../embedder/index.js' +import { DocumentParser } from '../parser/index.js' +import { type VectorChunk, VectorStore } from '../vectordb/index.js' + +// ============================================ +// Types +// ============================================ + +interface Options { + path?: string + baseDir?: string + dbPath?: string + cacheDir?: string + modelName?: string + maxFileSize?: number + batchSize?: number + recursive: boolean + includeHidden: boolean + extensions: string[] + excludes: string[] + maxFiles?: number + skipExisting: boolean + dryRun: boolean + progressEvery: number + failFast: boolean + failOnError: boolean + json: boolean + parsers?: string + help: boolean +} + +interface IngestStats { + total: number + processed: number + succeeded: number + failed: number + skipped: number + failures: { filePath: string; error: string }[] + startTimeMs: number +} + +// ============================================ +// Helpers +// ============================================ + +function splitList(value: string | undefined): string[] { + if (!value) return [] + return value + .split(',') + .map((item) => item.trim()) + .filter(Boolean) +} + +function formatDuration(ms: number): string { + const totalSeconds = Math.max(0, Math.round(ms / 1000)) + const hours = Math.floor(totalSeconds / 3600) + const minutes = Math.floor((totalSeconds % 3600) / 60) + const seconds = totalSeconds % 60 + return [hours, minutes, seconds].map((v) => String(v).padStart(2, '0')).join(':') +} + +function formatRate(processed: number, elapsedMs: number): string { + if (elapsedMs <= 0) return '0.0' + return (processed / (elapsedMs / 1000)).toFixed(1) +} + +function printHelp(): void { + console.log(` +MCP Local RAG Bulk Ingest + +Usage: + npx mcp-local-rag ingest --path [options] + +Options: + --path, -p File or directory to ingest (required) + --base-dir Base directory boundary (defaults to path or its parent) + --db-path LanceDB path (default: ./lancedb or DB_PATH) + --cache-dir Model cache dir (default: ./models or CACHE_DIR) + --model Embedding model (default: Xenova/all-MiniLM-L6-v2) + --max-file-size Max file size in bytes (default: 104857600) + --batch-size Embedding batch size (default: 8) + --extensions Comma-separated extensions (e.g., .pdf,.md) + --exclude Comma-separated path substrings to skip (added to defaults) + --no-recursive Do not traverse directories + --include-hidden Include hidden files and folders + --max-files Limit number of files processed + --skip-existing Skip files already indexed (default) + --force Re-ingest even if already indexed + --dry-run List file counts without ingesting + --progress-every Print progress every N files (default: 25) + --parsers Path to custom parser config JSON + --fail-fast Stop at first failure + --fail-on-error Exit with non-zero code if any failures + --json Output final summary as JSON + --help, -h Show this help message + +Examples: + npx mcp-local-rag ingest --path /Users/me/Desktop + npx mcp-local-rag ingest --path /Users/me/Desktop --extensions .pdf,.md + npx mcp-local-rag ingest --path /Users/me/Desktop --exclude node_modules,dist +`) +} + +function parseArgs(args: string[]): Options { + const options: Options = { + recursive: true, + includeHidden: false, + extensions: [], + excludes: [], + skipExisting: true, + dryRun: false, + progressEvery: 25, + failFast: false, + failOnError: false, + json: false, + help: false, + } + + for (let i = 0; i < args.length; i++) { + const arg = args[i] + + switch (arg) { + case '--help': + case '-h': + options.help = true + break + + case '--path': + case '-p': { + const value = args[i + 1] + if (!value) { + console.error('Error: --path requires a value') + process.exit(1) + } + options.path = value + i++ + break + } + + case '--base-dir': { + const value = args[i + 1] + if (!value) { + console.error('Error: --base-dir requires a value') + process.exit(1) + } + options.baseDir = value + i++ + break + } + + case '--db-path': { + const value = args[i + 1] + if (!value) { + console.error('Error: --db-path requires a value') + process.exit(1) + } + options.dbPath = value + i++ + break + } + + case '--cache-dir': { + const value = args[i + 1] + if (!value) { + console.error('Error: --cache-dir requires a value') + process.exit(1) + } + options.cacheDir = value + i++ + break + } + + case '--model': { + const value = args[i + 1] + if (!value) { + console.error('Error: --model requires a value') + process.exit(1) + } + options.modelName = value + i++ + break + } + + case '--max-file-size': { + const value = args[i + 1] + if (!value || Number.isNaN(Number(value))) { + console.error('Error: --max-file-size requires a numeric value') + process.exit(1) + } + options.maxFileSize = Number.parseInt(value, 10) + i++ + break + } + + case '--batch-size': { + const value = args[i + 1] + if (!value || Number.isNaN(Number(value))) { + console.error('Error: --batch-size requires a numeric value') + process.exit(1) + } + options.batchSize = Number.parseInt(value, 10) + i++ + break + } + + case '--extensions': + case '--ext': { + const value = args[i + 1] + if (!value) { + console.error('Error: --extensions requires a comma-separated list') + process.exit(1) + } + options.extensions.push(...splitList(value)) + i++ + break + } + + case '--exclude': { + const value = args[i + 1] + if (!value) { + console.error('Error: --exclude requires a comma-separated list') + process.exit(1) + } + options.excludes.push(...splitList(value)) + i++ + break + } + + case '--max-files': { + const value = args[i + 1] + if (!value || Number.isNaN(Number(value))) { + console.error('Error: --max-files requires a numeric value') + process.exit(1) + } + options.maxFiles = Number.parseInt(value, 10) + i++ + break + } + + case '--no-recursive': + options.recursive = false + break + + case '--recursive': + options.recursive = true + break + + case '--include-hidden': + options.includeHidden = true + break + + case '--skip-existing': + options.skipExisting = true + break + + case '--force': + options.skipExisting = false + break + + case '--dry-run': + options.dryRun = true + break + + case '--progress-every': { + const value = args[i + 1] + if (!value || Number.isNaN(Number(value))) { + console.error('Error: --progress-every requires a numeric value') + process.exit(1) + } + options.progressEvery = Number.parseInt(value, 10) + i++ + break + } + + case '--parsers': { + const value = args[i + 1] + if (!value) { + console.error('Error: --parsers requires a path') + process.exit(1) + } + options.parsers = value + i++ + break + } + + case '--fail-fast': + options.failFast = true + break + + case '--fail-on-error': + options.failOnError = true + break + + case '--json': + options.json = true + break + + default: { + if (arg?.startsWith('-')) { + console.error(`Unknown option: ${arg}`) + process.exit(1) + } + if (!options.path) { + if (!arg) { + console.error('Error: Missing path argument') + process.exit(1) + } + options.path = arg + } else { + console.error(`Unexpected argument: ${arg}`) + process.exit(1) + } + } + } + } + + return options +} + +function printProgress(stats: IngestStats): void { + const elapsedMs = Date.now() - stats.startTimeMs + const rate = formatRate(stats.processed, elapsedMs) + const remaining = stats.total - stats.processed + const etaMs = stats.processed > 0 ? (elapsedMs / stats.processed) * remaining : 0 + const eta = stats.processed > 0 ? formatDuration(etaMs) : '--:--:--' + + console.error( + `[ingest] ${stats.processed}/${stats.total} ` + + `ok:${stats.succeeded} fail:${stats.failed} skip:${stats.skipped} ` + + `${rate} files/s ETA ${eta}` + ) +} + +// ============================================ +// CLI Runner +// ============================================ + +export async function run(args: string[]): Promise { + const options = parseArgs(args) + + if (options.help) { + printHelp() + process.exit(0) + } + + if (!options.path) { + console.error('Error: --path is required') + printHelp() + process.exit(1) + } + + if (options.parsers) { + process.env['MCP_LOCAL_RAG_PARSERS'] = options.parsers + } + + const targetPath = resolve(options.path) + const targetStats = await stat(targetPath).catch((error) => { + console.error(`Error: Failed to access path ${targetPath}`) + throw error + }) + + const baseDir = options.baseDir || (targetStats.isDirectory() ? targetPath : dirname(targetPath)) + + const dbPath = options.dbPath || process.env['DB_PATH'] || './lancedb/' + const cacheDir = options.cacheDir || process.env['CACHE_DIR'] || './models/' + const modelName = options.modelName || process.env['MODEL_NAME'] || 'Xenova/all-MiniLM-L6-v2' + const maxFileSize = + options.maxFileSize || Number.parseInt(process.env['MAX_FILE_SIZE'] || '104857600', 10) + const batchSize = options.batchSize || 8 + + const parser = new DocumentParser({ baseDir, maxFileSize }) + + let files: string[] + if (targetStats.isDirectory()) { + const listOptions: { + directoryPath: string + recursive?: boolean + includeHidden?: boolean + extensions?: string[] + excludes?: string[] + } = { + directoryPath: targetPath, + recursive: options.recursive, + includeHidden: options.includeHidden, + } + if (options.extensions.length > 0) { + listOptions.extensions = options.extensions + } + if (options.excludes.length > 0) { + listOptions.excludes = options.excludes + } + files = await parser.listFilesInDirectory(listOptions) + } else if (targetStats.isFile()) { + files = [targetPath] + } else { + console.error(`Error: Path is not a file or directory: ${targetPath}`) + process.exit(1) + } + + if (options.excludes.length > 0) { + files = files.filter((filePath) => !options.excludes.some((skip) => filePath.includes(skip))) + } + + if (options.maxFiles !== undefined) { + files = files.slice(0, Math.max(0, options.maxFiles)) + } + + if (options.dryRun) { + const summary = { + totalFiles: files.length, + baseDir, + dbPath, + cacheDir, + modelName, + recursive: options.recursive, + includeHidden: options.includeHidden, + extensions: options.extensions, + excludes: options.excludes, + } + if (options.json) { + console.log(JSON.stringify(summary, null, 2)) + } else { + console.log('Dry run summary:') + console.log(summary) + } + process.exit(0) + } + + const vectorStore = new VectorStore({ dbPath, tableName: 'chunks' }) + await vectorStore.initialize() + + const embedder = new Embedder({ modelPath: modelName, batchSize, cacheDir }) + const chunker = new SemanticChunker() + + let existing = new Set() + if (options.skipExisting) { + const existingFiles = await vectorStore.listFiles() + existing = new Set(existingFiles.map((entry) => entry.filePath)) + } + + const stats: IngestStats = { + total: files.length, + processed: 0, + succeeded: 0, + failed: 0, + skipped: 0, + failures: [], + startTimeMs: Date.now(), + } + + for (const filePath of files) { + if (options.skipExisting && existing.has(filePath)) { + stats.skipped++ + stats.processed++ + if (stats.processed % options.progressEvery === 0) { + printProgress(stats) + } + continue + } + + try { + const isPdf = filePath.toLowerCase().endsWith('.pdf') + const text = isPdf + ? await parser.parsePdf(filePath, embedder) + : await parser.parseFile(filePath) + + const chunks = await chunker.chunkText(text, embedder) + if (chunks.length === 0) { + throw new Error( + 'No chunks generated (minimum 50 characters required). File may be empty or filtered.' + ) + } + + const embeddings = await embedder.embedBatch(chunks.map((chunk) => chunk.text)) + + if (!options.skipExisting) { + await vectorStore.deleteChunks(filePath) + } + + const timestamp = new Date().toISOString() + const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => { + const embedding = embeddings[index] + if (!embedding) { + throw new Error(`Missing embedding for chunk ${index}`) + } + return { + id: randomUUID(), + filePath, + chunkIndex: chunk.index, + text: chunk.text, + vector: embedding, + metadata: { + fileName: filePath.split('/').pop() || filePath, + fileSize: text.length, + fileType: filePath.split('.').pop() || '', + }, + timestamp, + } + }) + + await vectorStore.insertChunks(vectorChunks) + + stats.succeeded++ + } catch (error) { + stats.failed++ + stats.failures.push({ + filePath, + error: (error as Error).message, + }) + if (options.failFast) { + stats.processed++ + printProgress(stats) + break + } + } + + stats.processed++ + if (stats.processed % options.progressEvery === 0) { + printProgress(stats) + } + } + + const durationMs = Date.now() - stats.startTimeMs + const summary = { + total: stats.total, + processed: stats.processed, + succeeded: stats.succeeded, + failed: stats.failed, + skipped: stats.skipped, + duration: formatDuration(durationMs), + filesPerSecond: formatRate(stats.processed, durationMs), + failures: stats.failures.slice(0, 20), + } + + if (options.json) { + console.log(JSON.stringify(summary, null, 2)) + } else { + console.log('Ingest summary:') + console.log(summary) + } + + if (options.failOnError && stats.failed > 0) { + process.exit(1) + } +} diff --git a/src/index.ts b/src/index.ts index 5d4ad99..fa14c25 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,6 +1,7 @@ #!/usr/bin/env node // Entry point for RAG MCP Server +import { run as runBulkIngest } from './bin/ingest.js' import { run as runSkillsInstall } from './bin/install-skills.js' import { RAGServer } from './server/index.js' import type { GroupingMode } from './vectordb/index.js' @@ -10,12 +11,14 @@ import type { GroupingMode } from './vectordb/index.js' // ============================================ const args = process.argv.slice(2) +let handled = false // Handle "skills" subcommand if (args[0] === 'skills') { if (args[1] === 'install') { // npx mcp-local-rag skills install [options] runSkillsInstall(args.slice(2)) + handled = true process.exit(0) } else { console.error('Unknown skills subcommand. Usage: npx mcp-local-rag skills install [options]') @@ -24,6 +27,17 @@ if (args[0] === 'skills') { } } +// Handle "ingest" subcommand +if (args[0] === 'ingest') { + handled = true + runBulkIngest(args.slice(1)) + .then(() => process.exit(0)) + .catch((error) => { + console.error('Bulk ingest failed:', error) + process.exit(1) + }) +} + // ============================================ // MCP Server (default behavior) // ============================================ @@ -125,5 +139,7 @@ process.on('uncaughtException', (error) => { process.exit(1) }) -// Execute main -main() +// Execute main (only if no subcommand) +if (!handled) { + main() +} diff --git a/src/parser/__tests__/parser.test.ts b/src/parser/__tests__/parser.test.ts index c7a9ec2..9af761e 100644 --- a/src/parser/__tests__/parser.test.ts +++ b/src/parser/__tests__/parser.test.ts @@ -9,11 +9,15 @@ describe('DocumentParser', () => { let parser: DocumentParser const testDir = join(process.cwd(), 'tmp', 'test-parser') const maxFileSize = 100 * 1024 * 1024 // 100MB + let originalParserEnv: string | undefined beforeEach(async () => { // Create test directory await mkdir(testDir, { recursive: true }) + originalParserEnv = process.env['MCP_LOCAL_RAG_PARSERS'] + process.env['MCP_LOCAL_RAG_PARSERS'] = undefined + parser = new DocumentParser({ baseDir: testDir, maxFileSize, @@ -21,6 +25,12 @@ describe('DocumentParser', () => { }) afterEach(async () => { + if (originalParserEnv === undefined) { + process.env['MCP_LOCAL_RAG_PARSERS'] = undefined + } else { + process.env['MCP_LOCAL_RAG_PARSERS'] = originalParserEnv + } + // Cleanup test directory await rm(testDir, { recursive: true, force: true }) }) @@ -116,6 +126,33 @@ describe('DocumentParser', () => { expect(result).toBe(content) }) + it('should parse JSON file successfully', async () => { + const filePath = join(testDir, 'test.json') + const content = '{"name":"local-rag","version":"1.0.0"}' + await writeFile(filePath, content, 'utf-8') + + const result = await parser.parseFile(filePath) + expect(result).toBe(content) + }) + + it('should parse YAML file successfully', async () => { + const filePath = join(testDir, 'test.yaml') + const content = 'name: local-rag\nversion: 1.0.0' + await writeFile(filePath, content, 'utf-8') + + const result = await parser.parseFile(filePath) + expect(result).toBe(content) + }) + + it('should parse source code file successfully', async () => { + const filePath = join(testDir, 'test.ts') + const content = 'export const value = 42' + await writeFile(filePath, content, 'utf-8') + + const result = await parser.parseFile(filePath) + expect(result).toBe(content) + }) + it('should throw ValidationError for unsupported file format', async () => { const filePath = join(testDir, 'test.xyz') await writeFile(filePath, 'fake xyz content') @@ -148,6 +185,31 @@ describe('DocumentParser', () => { const nonExistentFile = join(testDir, 'nonexistent.txt') await expect(parser.parseFile(nonExistentFile)).rejects.toThrow(FileOperationError) }) + + it('should surface custom parser load failures with guidance', async () => { + const configPath = join(testDir, 'file_parsers.json') + await writeFile( + configPath, + JSON.stringify({ '.note': { module: '/no/such/parser.js', export: 'parseFile' } }, null, 2), + 'utf-8' + ) + + process.env['MCP_LOCAL_RAG_PARSERS'] = configPath + parser = new DocumentParser({ + baseDir: testDir, + maxFileSize, + }) + + const filePath = join(testDir, 'test.note') + await writeFile(filePath, 'note content', 'utf-8') + + await expect(parser.parseFile(filePath)).rejects.toThrow( + expect.objectContaining({ + name: 'FileOperationError', + message: expect.stringMatching(/Custom parser for \.note failed to load/), + }) + ) + }) }) describe('parseTxt', () => { diff --git a/src/parser/custom/sample-note-parser.ts b/src/parser/custom/sample-note-parser.ts new file mode 100644 index 0000000..562bc4e --- /dev/null +++ b/src/parser/custom/sample-note-parser.ts @@ -0,0 +1,5 @@ +import { readFile } from 'node:fs/promises' + +export async function parseFile(filePath: string): Promise { + return await readFile(filePath, 'utf-8') +} diff --git a/src/parser/index.ts b/src/parser/index.ts index 0a2c707..2589e8c 100644 --- a/src/parser/index.ts +++ b/src/parser/index.ts @@ -1,17 +1,119 @@ -// DocumentParser implementation with PDF/DOCX/TXT/MD support +// DocumentParser implementation with PDF/DOCX/PPTX/XLSX/TXT/MD and text-based config/code support -import { statSync } from 'node:fs' -import { readFile } from 'node:fs/promises' -import { extname, isAbsolute, resolve } from 'node:path' +import { existsSync, statSync } from 'node:fs' +import { readFile, readdir, stat } from 'node:fs/promises' +import { dirname, extname, isAbsolute, resolve } from 'node:path' +import { pathToFileURL } from 'node:url' +import JSZip from 'jszip' import mammoth from 'mammoth' import { getDocument } from 'pdfjs-dist/legacy/build/pdf.mjs' import type { TextItem } from 'pdfjs-dist/types/src/display/api' +import * as XLSX from 'xlsx' import { type EmbedderInterface, type PageData, filterPageBoundarySentences } from './pdf-filter.js' // ============================================ // Type Definitions // ============================================ +type CustomParser = (filePath: string) => Promise + +interface CustomParserSpec { + module: string + export?: string +} + +const MARKDOWN_EXTENSIONS = new Set(['.md', '.markdown', '.mdx']) +const TEXT_EXTENSIONS = new Set(['.txt', '.log', '.rst']) +const CODE_EXTENSIONS = new Set([ + '.py', + '.pyi', + '.js', + '.jsx', + '.ts', + '.tsx', + '.java', + '.kt', + '.kts', + '.go', + '.rs', + '.c', + '.h', + '.hpp', + '.cpp', + '.cc', + '.cs', + '.rb', + '.php', + '.swift', + '.scala', + '.lua', + '.sh', + '.bash', + '.zsh', + '.ps1', + '.sql', + '.graphql', + '.gql', + '.vue', + '.svelte', + '.dart', + '.r', + '.m', + '.mm', + '.pl', + '.pm', + '.t', +]) +const CONFIG_EXTENSIONS = new Set([ + '.json', + '.jsonl', + '.yaml', + '.yml', + '.toml', + '.ini', + '.cfg', + '.conf', + '.config', + '.settings', + '.env', +]) +const CSV_EXTENSIONS = new Set(['.csv', '.tsv']) +const EXCEL_EXTENSIONS = new Set(['.xlsx', '.xls']) +const POWERPOINT_EXTENSIONS = new Set(['.pptx']) +const DEFAULT_EXCLUDES = new Set([ + '.git', + '.hg', + '.svn', + 'node_modules', + 'dist', + 'build', + 'out', + '.next', + '.nuxt', + '.svelte-kit', + '.astro', + 'target', + '.gradle', + '.mvn', + 'bin', + 'obj', + '.vs', + '.cache', + '.venv', + 'venv', + '__pycache__', + '.pytest_cache', + '.mypy_cache', + '.ruff_cache', + 'coverage', + 'vendor', + '.DS_Store', + 'Thumbs.db', + 'desktop.ini', +]) + +const DEFAULT_PARSER_CONFIG = resolve(process.cwd(), 'config', 'file_parsers.json') + /** * DocumentParser configuration */ @@ -53,18 +155,25 @@ export class FileOperationError extends Error { // ============================================ /** - * Document parser class (PDF/DOCX/TXT/MD support) + * Document parser class (PDF/DOCX/PPTX/XLSX/TXT/MD + config/code support) * * Responsibilities: * - File path validation (path traversal prevention) * - File size validation (100MB limit) - * - Parse 4 formats (PDF/DOCX/TXT/MD) + * - Parse common formats (PDF/DOCX/PPTX/XLSX/TXT/MD + config/code) */ export class DocumentParser { private readonly config: ParserConfig + private readonly customParserConfigPath: string + private customParsersLoaded = false + private readonly customParsers = new Map() + private readonly customParserErrors = new Map() + private readonly customParserModules = new Map() + private readonly customParserSpecs = new Map() constructor(config: ParserConfig) { this.config = config + this.customParserConfigPath = process.env['MCP_LOCAL_RAG_PARSERS'] || DEFAULT_PARSER_CONFIG } /** @@ -115,6 +224,245 @@ export class DocumentParser { } } + /** + * Directory path validation (Absolute path requirement + Path traversal prevention) + * + * @param directoryPath - Directory path to validate (must be absolute) + * @throws ValidationError - When path is not absolute or outside BASE_DIR + */ + validateDirectoryPath(directoryPath: string): void { + // Check if path is absolute + if (!isAbsolute(directoryPath)) { + throw new ValidationError( + `Directory path must be absolute path (received: ${directoryPath}). Please provide an absolute path within BASE_DIR.` + ) + } + + // Check if path is within BASE_DIR + const baseDir = resolve(this.config.baseDir) + const normalizedPath = resolve(directoryPath) + + if (!normalizedPath.startsWith(baseDir)) { + throw new ValidationError( + `Directory path must be within BASE_DIR (${baseDir}). Received path outside BASE_DIR: ${directoryPath}` + ) + } + } + + private normalizeExtension(extension: string): string | null { + const trimmed = extension.trim() + if (!trimmed) { + return null + } + return trimmed.startsWith('.') ? trimmed.toLowerCase() : `.${trimmed.toLowerCase()}` + } + + private async ensureCustomParsersLoaded(): Promise { + if (this.customParsersLoaded) { + return + } + + this.customParsersLoaded = true + + if (!existsSync(this.customParserConfigPath)) { + return + } + + try { + const raw = await readFile(this.customParserConfigPath, 'utf-8') + const data = JSON.parse(raw) as Record + + for (const [extension, spec] of Object.entries(data)) { + const normalized = this.normalizeExtension(extension) + if (!normalized) { + continue + } + + const moduleSpec: CustomParserSpec = typeof spec === 'string' ? { module: spec } : spec + if (!moduleSpec?.module) { + console.warn(`Custom parser for ${normalized} missing module path`) + this.customParserErrors.set( + normalized, + new Error(`Custom parser for ${normalized} missing module path`) + ) + continue + } + + this.customParserSpecs.set(normalized, moduleSpec) + + const isFilePath = moduleSpec.module.startsWith('.') || moduleSpec.module.startsWith('/') + const resolvedPath = isFilePath + ? resolve(process.cwd(), moduleSpec.module) + : moduleSpec.module + const importTarget = isFilePath ? pathToFileURL(resolvedPath).href : resolvedPath + this.customParserModules.set(normalized, resolvedPath) + + try { + const mod = await import(importTarget) + const handler = + (moduleSpec.export && mod[moduleSpec.export]) || + mod.default || + mod.parseFile || + mod.parse + + if (typeof handler !== 'function') { + console.warn(`Custom parser for ${normalized} did not export a function`) + this.customParserErrors.set( + normalized, + new Error(`Custom parser for ${normalized} did not export a function`) + ) + continue + } + + this.customParsers.set(normalized, handler as CustomParser) + this.customParserErrors.delete(normalized) + } catch (error) { + console.warn(`Failed to load custom parser for ${normalized}:`, error) + this.customParserErrors.set(normalized, error as Error) + } + } + } catch (error) { + console.warn(`Failed to read custom parser config: ${this.customParserConfigPath}`, error) + } + } + + private extractMissingModuleName(error: Error): string | null { + const message = error.message || '' + const match = + message.match(/Cannot find module '([^']+)'/) || + message.match(/Cannot find module "([^"]+)"/) || + message.match(/Cannot find package '([^']+)'/) || + message.match(/Cannot find package "([^"]+)"/) + return match?.[1] ?? null + } + + private extractUnknownFileExtension(error: Error): string | null { + const message = error.message || '' + const match = message.match(/Unknown file extension ["']?(\.[^"']+)["']?/) + return match?.[1] ?? null + } + + private buildCustomParserHint(extension: string, error: Error): string { + const missingModule = this.extractMissingModuleName(error) + if (missingModule) { + const modulePath = this.customParserModules.get(extension) + const parserDir = + modulePath && (modulePath.startsWith('/') || modulePath.startsWith('.')) + ? dirname(modulePath) + : undefined + const installCmd = parserDir + ? `cd ${parserDir} && npm i ${missingModule}` + : `npm i ${missingModule}` + return `Missing dependency "${missingModule}". Install it where the parser lives (e.g. ${installCmd}) or bundle the parser.` + } + + const unknownExtension = this.extractUnknownFileExtension(error) + if (unknownExtension) { + return `Node cannot import "${unknownExtension}" files. Compile the parser to JS or use a .mjs/.cjs file.` + } + + if (error.message.includes('did not export a function')) { + const moduleSpec = this.customParserSpecs.get(extension) + const exportHint = moduleSpec?.export + ? `"${moduleSpec.export}"` + : 'default export, parseFile, or parse' + return `Ensure the module exports a function (${exportHint}).` + } + + return 'Check the parser module path and its dependencies.' + } + + private formatCustomParserError( + phase: 'load' | 'run', + extension: string, + error: Error, + filePath?: string + ): string { + const hint = this.buildCustomParserHint(extension, error) + if (phase === 'load') { + return `Custom parser for ${extension} failed to load. ${hint}` + } + return `Custom parser for ${extension} failed while parsing ${filePath || 'file'}. ${hint}` + } + + async getSupportedExtensions(): Promise { + await this.ensureCustomParsersLoaded() + const builtIn = new Set([ + '.pdf', + '.docx', + ...MARKDOWN_EXTENSIONS, + ...TEXT_EXTENSIONS, + ...CODE_EXTENSIONS, + ...CONFIG_EXTENSIONS, + ...CSV_EXTENSIONS, + ...EXCEL_EXTENSIONS, + ...POWERPOINT_EXTENSIONS, + ]) + + for (const customExt of this.customParsers.keys()) { + builtIn.add(customExt) + } + + return Array.from(builtIn).sort() + } + + async listFilesInDirectory(options: { + directoryPath: string + recursive?: boolean + includeHidden?: boolean + extensions?: string[] + excludes?: string[] + }): Promise { + const { directoryPath, recursive = true, includeHidden = false, excludes = [] } = options + this.validateDirectoryPath(directoryPath) + + const stats = await stat(directoryPath) + if (!stats.isDirectory()) { + throw new ValidationError(`Path is not a directory: ${directoryPath}`) + } + + const supported = options.extensions + ? options.extensions + .map((ext) => this.normalizeExtension(ext)) + .filter((ext): ext is string => Boolean(ext)) + : await this.getSupportedExtensions() + const supportedSet = new Set(supported) + const excludePatterns = excludes.filter((pattern) => pattern.length > 0) + + const results: string[] = [] + const walk = async (dir: string): Promise => { + const entries = await readdir(dir, { withFileTypes: true }) + for (const entry of entries) { + if (!includeHidden && entry.name.startsWith('.')) { + continue + } + const fullPath = resolve(dir, entry.name) + if (DEFAULT_EXCLUDES.has(entry.name)) { + continue + } + if (excludePatterns.some((pattern) => fullPath.includes(pattern))) { + continue + } + if (entry.isDirectory()) { + if (recursive) { + await walk(fullPath) + } + continue + } + if (!entry.isFile()) { + continue + } + const ext = extname(entry.name).toLowerCase() + if (supportedSet.has(ext)) { + results.push(fullPath) + } + } + } + + await walk(directoryPath) + return results + } + /** * File parsing (auto format detection) * @@ -129,15 +477,49 @@ export class DocumentParser { this.validateFileSize(filePath) // Format detection (PDF uses parsePdf directly) + await this.ensureCustomParsersLoaded() const ext = extname(filePath).toLowerCase() + + const loadError = this.customParserErrors.get(ext) + if (loadError) { + throw new FileOperationError(this.formatCustomParserError('load', ext, loadError), loadError) + } + + const customParser = this.customParsers.get(ext) + if (customParser) { + try { + return await customParser(filePath) + } catch (error) { + throw new FileOperationError( + this.formatCustomParserError('run', ext, error as Error, filePath), + error as Error + ) + } + } + switch (ext) { case '.docx': return await this.parseDocx(filePath) - case '.txt': - return await this.parseTxt(filePath) - case '.md': - return await this.parseMd(filePath) + case '.pptx': + return await this.parsePptx(filePath) + case '.xlsx': + case '.xls': + return await this.parseXlsx(filePath) default: + if (ext === '.txt') { + return await this.parseTxt(filePath) + } + if (MARKDOWN_EXTENSIONS.has(ext)) { + return await this.parseMd(filePath) + } + if ( + TEXT_EXTENSIONS.has(ext) || + CODE_EXTENSIONS.has(ext) || + CONFIG_EXTENSIONS.has(ext) || + CSV_EXTENSIONS.has(ext) + ) { + return await this.parseText(filePath, 'TXT') + } throw new ValidationError(`Unsupported file format: ${ext}`) } } @@ -217,22 +599,112 @@ export class DocumentParser { } /** - * TXT parsing (using fs.readFile) + * PPTX parsing (slides text) * - * @param filePath - TXT file path + * @param filePath - PPTX file path * @returns Parsed text - * @throws FileOperationError - File read failed + * @throws FileOperationError - File read failed, parse failed */ - private async parseTxt(filePath: string): Promise { + private async parsePptx(filePath: string): Promise { + try { + const buffer = await readFile(filePath) + const zip = await JSZip.loadAsync(buffer) + const slideEntries = Object.keys(zip.files) + .filter((name) => name.startsWith('ppt/slides/slide') && name.endsWith('.xml')) + .sort((a, b) => a.localeCompare(b, undefined, { numeric: true })) + + const notesEntries = Object.keys(zip.files) + .filter((name) => name.startsWith('ppt/notesSlides/notesSlide') && name.endsWith('.xml')) + .sort((a, b) => a.localeCompare(b, undefined, { numeric: true })) + + const sections: string[] = [] + for (const name of [...slideEntries, ...notesEntries]) { + const xml = await zip.files[name]?.async('string') + if (!xml) { + continue + } + const text = this.extractPptxText(xml) + if (text.trim()) { + sections.push(text) + } + } + + const combined = sections.join('\n\n') + console.error(`Parsed PPTX: ${filePath} (${combined.length} characters)`) + return combined + } catch (error) { + throw new FileOperationError(`Failed to parse PPTX: ${filePath}`, error as Error) + } + } + + /** + * XLSX/XLS parsing (sheet text) + * + * @param filePath - Excel file path + * @returns Parsed text + * @throws FileOperationError - File read failed, parse failed + */ + private async parseXlsx(filePath: string): Promise { + try { + const buffer = await readFile(filePath) + const workbook = XLSX.read(buffer, { type: 'buffer' }) + const sections: string[] = [] + for (const sheetName of workbook.SheetNames) { + const sheet = workbook.Sheets[sheetName] + if (!sheet) { + continue + } + const csv = XLSX.utils.sheet_to_csv(sheet) + if (csv.trim()) { + sections.push(`Sheet: ${sheetName}\n${csv}`) + } + } + const combined = sections.join('\n\n') + console.error(`Parsed XLSX: ${filePath} (${combined.length} characters)`) + return combined + } catch (error) { + throw new FileOperationError(`Failed to parse XLSX: ${filePath}`, error as Error) + } + } + + private decodeXmlEntities(text: string): string { + return text + .replace(/&/g, '&') + .replace(/</g, '<') + .replace(/>/g, '>') + .replace(/"/g, '"') + .replace(/'/g, "'") + .replace(/'/g, "'") + .replace(/&#x([0-9a-fA-F]+);/g, (_, hex) => String.fromCharCode(Number.parseInt(hex, 16))) + .replace(/&#([0-9]+);/g, (_, num) => String.fromCharCode(Number.parseInt(num, 10))) + } + + private extractPptxText(xml: string): string { + const matches = Array.from(xml.matchAll(/]*>(.*?)<\/a:t>/g)) + return matches.map((match) => this.decodeXmlEntities(match[1] || '')).join(' ') + } + + private async parseText(filePath: string, label: string): Promise { try { const text = await readFile(filePath, 'utf-8') - console.error(`Parsed TXT: ${filePath} (${text.length} characters)`) + console.error(`Parsed ${label}: ${filePath} (${text.length} characters)`) return text } catch (error) { - throw new FileOperationError(`Failed to parse TXT: ${filePath}`, error as Error) + throw new FileOperationError(`Failed to parse ${label}: ${filePath}`, error as Error) } } + /** + * TXT parsing (using fs.readFile) + * + * @param filePath - TXT file path + * @returns Parsed text + * @throws FileOperationError - File read failed + */ + private async parseTxt(filePath: string): Promise { + return await this.parseText(filePath, 'TXT') + } + /** * MD parsing (using fs.readFile) * @@ -241,12 +713,6 @@ export class DocumentParser { * @throws FileOperationError - File read failed */ private async parseMd(filePath: string): Promise { - try { - const text = await readFile(filePath, 'utf-8') - console.error(`Parsed MD: ${filePath} (${text.length} characters)`) - return text - } catch (error) { - throw new FileOperationError(`Failed to parse MD: ${filePath}`, error as Error) - } + return await this.parseText(filePath, 'MD') } } diff --git a/src/server/index.ts b/src/server/index.ts index ea30fd9..5a6bcf0 100644 --- a/src/server/index.ts +++ b/src/server/index.ts @@ -1,7 +1,7 @@ // RAGServer implementation with MCP tools import { randomUUID } from 'node:crypto' -import { readFile, unlink } from 'node:fs/promises' +import { readFile, stat, unlink } from 'node:fs/promises' import { Server } from '@modelcontextprotocol/sdk/server/index.js' import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js' import { @@ -65,6 +65,26 @@ export interface QueryDocumentsInput { export interface IngestFileInput { /** File path */ filePath: string + /** Recursive scan for directories (default true) */ + recursive?: boolean + /** Include hidden files when ingesting directories (default false) */ + includeHidden?: boolean + /** Restrict to extensions when ingesting directories (e.g., [".md", ".ts"]) */ + extensions?: string[] +} + +/** + * ingest_directory (via ingest_file when path is a directory) + */ +export interface IngestDirectoryInput { + /** Directory path */ + directoryPath: string + /** Recursive scan (default true) */ + recursive?: boolean + /** Include hidden files (default false) */ + includeHidden?: boolean + /** Restrict to extensions (e.g., [".md", ".ts"]) */ + extensions?: string[] } /** @@ -110,6 +130,22 @@ export interface IngestResult { timestamp: string } +/** + * ingest_directory result + */ +export interface IngestDirectoryResult { + /** Directory path */ + directoryPath: string + /** Total files found */ + filesProcessed: number + /** Files ingested successfully */ + filesSucceeded: number + /** Files failed */ + filesFailed: number + /** Error details for failed files */ + failures: { filePath: string; error: string }[] +} + /** * query_documents tool output */ @@ -134,7 +170,7 @@ export interface QueryResult { * RAG server compliant with MCP Protocol * * Responsibilities: - * - MCP tool integration (4 tools) + * - MCP tool integration (6 tools) * - Tool handler implementation * - Error handling * - Initialization (LanceDB, Transformers.js) @@ -214,14 +250,28 @@ export class RAGServer { { name: 'ingest_file', description: - 'Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.', + 'Ingest a document file (PDF, DOCX, PPTX, XLSX/XLS, TXT, MD, JSON, YAML, config files, source code) into the vector database for semantic search. File path must be an absolute path within BASE_DIR. You can also pass a directory path to ingest all supported files inside it (common dependency/build folders are skipped by default, e.g., node_modules, dist, build, target, bin, obj).', inputSchema: { type: 'object', properties: { filePath: { type: 'string', description: - 'Absolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"', + 'Absolute path to the file or directory to ingest. Example: "/Users/user/documents/manual.pdf" or "/Users/user/Documents"', + }, + recursive: { + type: 'boolean', + description: 'When filePath is a directory, scan subfolders (default true).', + }, + includeHidden: { + type: 'boolean', + description: 'When filePath is a directory, include hidden files (default false).', + }, + extensions: { + type: 'array', + items: { type: 'string' }, + description: + 'When filePath is a directory, limit files to these extensions (e.g., [".md", ".ts"]).', }, }, required: ['filePath'], @@ -380,129 +430,209 @@ export class RAGServer { } } - /** - * ingest_file tool handler (re-ingestion support, transaction processing, rollback capability) - */ - async handleIngestFile( - args: IngestFileInput - ): Promise<{ content: [{ type: 'text'; text: string }] }> { + private async ingestSingleFile(filePath: string): Promise { let backup: VectorChunk[] | null = null + // Parse file (with header/footer filtering for PDFs) + // For raw-data files (from ingest_data), read directly without validation + // since the path is internally generated and content is already processed + const isPdf = filePath.toLowerCase().endsWith('.pdf') + let text: string + if (isRawDataPath(filePath)) { + // Raw-data files: skip validation, read directly + text = await readFile(filePath, 'utf-8') + console.error(`Read raw-data file: ${filePath} (${text.length} characters)`) + } else if (isPdf) { + text = await this.parser.parsePdf(filePath, this.embedder) + } else { + text = await this.parser.parseFile(filePath) + } + + // Split text into semantic chunks + const chunks = await this.chunker.chunkText(text, this.embedder) + + // Fail-fast: Prevent data loss when chunking produces 0 chunks + // This check must happen BEFORE delete to preserve existing data on re-ingest + if (chunks.length === 0) { + throw new McpError( + ErrorCode.InvalidParams, + `No chunks generated from file: ${filePath}. The file may be empty or all content was filtered (minimum 50 characters required). Existing data has been preserved.` + ) + } + + // Generate embeddings for final chunks + const embeddings = await this.embedder.embedBatch(chunks.map((chunk) => chunk.text)) + + // Create backup (if existing data exists) try { - // Parse file (with header/footer filtering for PDFs) - // For raw-data files (from ingest_data), read directly without validation - // since the path is internally generated and content is already processed - const isPdf = args.filePath.toLowerCase().endsWith('.pdf') - let text: string - if (isRawDataPath(args.filePath)) { - // Raw-data files: skip validation, read directly - text = await readFile(args.filePath, 'utf-8') - console.error(`Read raw-data file: ${args.filePath} (${text.length} characters)`) - } else if (isPdf) { - text = await this.parser.parsePdf(args.filePath, this.embedder) - } else { - text = await this.parser.parseFile(args.filePath) + const existingFiles = await this.vectorStore.listFiles() + const existingFile = existingFiles.find((file) => file.filePath === filePath) + if (existingFile && existingFile.chunkCount > 0) { + // Backup existing data (retrieve via search) + const queryVector = embeddings[0] || [] + if (queryVector.length > 0) { + const allChunks = await this.vectorStore.search(queryVector, undefined, 20) // Retrieve max 20 items + backup = allChunks + .filter((chunk) => chunk.filePath === filePath) + .map((chunk) => ({ + id: randomUUID(), + filePath: chunk.filePath, + chunkIndex: chunk.chunkIndex, + text: chunk.text, + vector: queryVector, // Use dummy vector since actual vector cannot be retrieved + metadata: chunk.metadata, + timestamp: new Date().toISOString(), + })) + } + console.error(`Backup created: ${backup?.length || 0} chunks for ${filePath}`) } + } catch (error) { + // Backup creation failure is warning only (for new files) + console.warn('Failed to create backup (new file?):', error) + } - // Split text into semantic chunks - const chunks = await this.chunker.chunkText(text, this.embedder) + // Delete existing data + await this.vectorStore.deleteChunks(filePath) + console.error(`Deleted existing chunks for: ${filePath}`) - // Fail-fast: Prevent data loss when chunking produces 0 chunks - // This check must happen BEFORE delete to preserve existing data on re-ingest - if (chunks.length === 0) { - throw new McpError( - ErrorCode.InvalidParams, - `No chunks generated from file: ${args.filePath}. The file may be empty or all content was filtered (minimum 50 characters required). Existing data has been preserved.` - ) + // Create vector chunks + const timestamp = new Date().toISOString() + const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => { + const embedding = embeddings[index] + if (!embedding) { + throw new Error(`Missing embedding for chunk ${index}`) } + return { + id: randomUUID(), + filePath, + chunkIndex: chunk.index, + text: chunk.text, + vector: embedding, + metadata: { + fileName: filePath.split('/').pop() || filePath, + fileSize: text.length, + fileType: filePath.split('.').pop() || '', + }, + timestamp, + } + }) - // Generate embeddings for final chunks - const embeddings = await this.embedder.embedBatch(chunks.map((chunk) => chunk.text)) + // Insert vectors (transaction processing) + try { + await this.vectorStore.insertChunks(vectorChunks) + console.error(`Inserted ${vectorChunks.length} chunks for: ${filePath}`) + + // Delete backup on success + backup = null + } catch (insertError) { + // Rollback on error + if (backup && backup.length > 0) { + console.error('Ingestion failed, rolling back...', insertError) + try { + await this.vectorStore.insertChunks(backup) + console.error(`Rollback completed: ${backup.length} chunks restored`) + } catch (rollbackError) { + console.error('Rollback failed:', rollbackError) + throw new Error( + `Failed to ingest file and rollback failed: ${(insertError as Error).message}` + ) + } + } + throw insertError + } + + // Result + return { + filePath, + chunkCount: chunks.length, + timestamp, + } + } + + private async ingestDirectory(input: IngestDirectoryInput): Promise { + const listOptions: { + directoryPath: string + recursive?: boolean + includeHidden?: boolean + extensions?: string[] + } = { directoryPath: input.directoryPath } + if (input.recursive !== undefined) { + listOptions.recursive = input.recursive + } + if (input.includeHidden !== undefined) { + listOptions.includeHidden = input.includeHidden + } + if (input.extensions !== undefined) { + listOptions.extensions = input.extensions + } + + const files = await this.parser.listFilesInDirectory(listOptions) - // Create backup (if existing data exists) + const failures: { filePath: string; error: string }[] = [] + let succeeded = 0 + + for (const filePath of files) { try { - const existingFiles = await this.vectorStore.listFiles() - const existingFile = existingFiles.find((file) => file.filePath === args.filePath) - if (existingFile && existingFile.chunkCount > 0) { - // Backup existing data (retrieve via search) - const queryVector = embeddings[0] || [] - if (queryVector.length > 0) { - const allChunks = await this.vectorStore.search(queryVector, undefined, 20) // Retrieve max 20 items - backup = allChunks - .filter((chunk) => chunk.filePath === args.filePath) - .map((chunk) => ({ - id: randomUUID(), - filePath: chunk.filePath, - chunkIndex: chunk.chunkIndex, - text: chunk.text, - vector: queryVector, // Use dummy vector since actual vector cannot be retrieved - metadata: chunk.metadata, - timestamp: new Date().toISOString(), - })) - } - console.error(`Backup created: ${backup?.length || 0} chunks for ${args.filePath}`) - } + await this.ingestSingleFile(filePath) + succeeded += 1 } catch (error) { - // Backup creation failure is warning only (for new files) - console.warn('Failed to create backup (new file?):', error) + failures.push({ + filePath, + error: (error as Error).message, + }) } + } - // Delete existing data - await this.vectorStore.deleteChunks(args.filePath) - console.error(`Deleted existing chunks for: ${args.filePath}`) + return { + directoryPath: input.directoryPath, + filesProcessed: files.length, + filesSucceeded: succeeded, + filesFailed: failures.length, + failures, + } + } - // Create vector chunks - const timestamp = new Date().toISOString() - const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => { - const embedding = embeddings[index] - if (!embedding) { - throw new Error(`Missing embedding for chunk ${index}`) - } - return { - id: randomUUID(), - filePath: args.filePath, - chunkIndex: chunk.index, - text: chunk.text, - vector: embedding, - metadata: { - fileName: args.filePath.split('/').pop() || args.filePath, - fileSize: text.length, - fileType: args.filePath.split('.').pop() || '', - }, - timestamp, + /** + * ingest_file tool handler (re-ingestion support, transaction processing, rollback capability) + */ + async handleIngestFile( + args: IngestFileInput + ): Promise<{ content: [{ type: 'text'; text: string }] }> { + try { + if (!isRawDataPath(args.filePath)) { + let statsResult: Awaited> | null = null + try { + statsResult = await stat(args.filePath) + } catch { + statsResult = null } - }) - // Insert vectors (transaction processing) - try { - await this.vectorStore.insertChunks(vectorChunks) - console.error(`Inserted ${vectorChunks.length} chunks for: ${args.filePath}`) - - // Delete backup on success - backup = null - } catch (insertError) { - // Rollback on error - if (backup && backup.length > 0) { - console.error('Ingestion failed, rolling back...', insertError) - try { - await this.vectorStore.insertChunks(backup) - console.error(`Rollback completed: ${backup.length} chunks restored`) - } catch (rollbackError) { - console.error('Rollback failed:', rollbackError) - throw new Error( - `Failed to ingest file and rollback failed: ${(insertError as Error).message}` - ) + if (statsResult?.isDirectory()) { + const directoryInput: IngestDirectoryInput = { directoryPath: args.filePath } + if (args.recursive !== undefined) { + directoryInput.recursive = args.recursive + } + if (args.includeHidden !== undefined) { + directoryInput.includeHidden = args.includeHidden + } + if (args.extensions !== undefined) { + directoryInput.extensions = args.extensions + } + + const result = await this.ingestDirectory(directoryInput) + + return { + content: [ + { + type: 'text', + text: JSON.stringify(result, null, 2), + }, + ], } } - throw insertError } - // Result - const result: IngestResult = { - filePath: args.filePath, - chunkCount: chunks.length, - timestamp, - } + const result = await this.ingestSingleFile(args.filePath) return { content: [