feat: Complete RAG System Implementation with Local Embeddings by olivermontes · Pull Request #179 · levante-hub/levante

olivermontes · 2026-01-17T21:05:55Z

RAG System Implementation

Closes #178

Overview

Complete implementation of local Retrieval-Augmented Generation (RAG) system for Levante with offline support and extensible file processing architecture.

Key Features

✅ Fully Local RAG

HuggingFace Transformers embeddings (Xenova/all-MiniLM-L6-v2)
LanceDB vector database with offline support
All data stays local, zero external API calls
Privacy-first approach

✅ Extensible File Support

PDF, DOCX, TXT, MD, JSON, CSV, HTML, HTM (8 types)
FileReader pattern following vectorstores.org architecture
Add new file types without database migrations
Individual reader classes for separation of concerns

✅ Robust Document Management

Upload with 100MB limit and validation
Chunk tracking with precise deletion
Delete all with confirmation modal
Real-time stats (total, indexed, processing, failed)

✅ Developer Experience

Dedicated RAG logging category (DEBUG_RAG)
Comprehensive error messages with possible causes
Database migrations (0006, 0007, 0008)
Migration runner in UI

Architecture

RAG Flow

Upload → DocumentProcessor → FileReader → Chunker → HuggingFace → LanceDB
                                                                       ↓
Chat → RAG Tool → Vector Search → Context Retrieval → LLM Response

File Reader Pattern (vectorstores.org)

DocumentProcessor → FileReaderFactory → Individual Readers
                          ↓
       Extension Mapping (pdf→PDFReader, csv→CSVReader, etc.)

Commits

bcbb8d4 - RAG system foundation
- Dependencies and database setup
- Core services (RAGService, DocumentProcessor, DocumentService)
- IPC layer and preload API
eca3427 - Complete RAG implementation
- UI components (KnowledgePage, KnowledgeStore)
- Chunk ID tracking for precise deletion
- RAG logging category
- Delete all functionality
5894574 - Extensible file reader architecture
- FileReader pattern implementation
- CSV and HTML support
- Removed file_type CHECK constraint (Migration 0008)
- Code-based validation only

Database Changes

Migration 0006: documents table

CREATE TABLE documents (
  id TEXT PRIMARY KEY,
  filename TEXT NOT NULL,
  filepath TEXT NOT NULL,
  file_type TEXT NOT NULL,
  file_size INTEGER NOT NULL,
  status TEXT NOT NULL DEFAULT 'processing',
  chunk_count INTEGER DEFAULT 0,
  error_message TEXT DEFAULT NULL,
  uploaded_at INTEGER NOT NULL,
  indexed_at INTEGER DEFAULT NULL
);

Migration 0007: chunk_ids column

Added chunk_ids TEXT DEFAULT NULL for LanceDB chunk tracking
Enables precise deletion using chunk IDs

Migration 0008: Remove file_type constraint

Removed CHECK constraint on file_type
Validation now in application code (FileReaderFactory)
No DB migration needed for new file types

File Structure

New Files

database/migrations/
├── 0006_add_documents.sql
├── 0007_add_chunk_ids.sql
└── 0008_expand_file_types.sql

src/main/services/rag/
├── documentProcessor.ts
├── ragService.ts
└── readers/
    ├── IFileReader.ts
    ├── TextFileReader.ts
    ├── JSONFileReader.ts
    ├── PDFFileReader.ts
    ├── DOCXFileReader.ts
    ├── CSVFileReader.ts
    ├── HTMLFileReader.ts
    ├── FileReaderFactory.ts
    └── index.ts

src/main/services/documentService.ts
src/main/ipc/documentHandlers.ts
src/preload/api/documents.ts

src/renderer/pages/KnowledgePage.tsx
src/renderer/stores/knowledgeStore.ts

Modified Files

src/types/database.ts - Document types
src/main/services/databaseService.ts - Migrations
src/renderer/stores/knowledgeStore.ts - State management
package.json - Dependencies

Configuration

Environment Variables

DEBUG_RAG=true    # Enable RAG logging
DEBUG_ENABLED=true
LOG_LEVEL=debug

Storage Paths

Documents: ~/levante/documents/
Vector DB: ~/levante/lancedb/
HuggingFace cache: ~/.cache/huggingface/

Testing

✅ Tested

Upload flow (PDF, MD, JSON, CSV tested)
File type validation
Document deletion with chunk cleanup
Delete all with confirmation
Chunk ID tracking
Database migrations
Stats display
Error handling

🔄 Pending

RAG tool integration in chat (AI queries)
Settings toggle
System prompt integration
E2E RAG queries
DOCX support (needs mammoth library)
Large file performance (90MB+)

Breaking Changes

None. This is a new feature with no impact on existing functionality.

Migration Guide

First run:

App will download HuggingFace model (~100MB) to ~/.cache/huggingface/
Migrations 0006, 0007, 0008 run automatically
Navigate to Knowledge page to upload documents

Adding new file types (developers):

Create reader class in src/main/services/rag/readers/ (e.g., ExcelFileReader.ts)
Export in readers/index.ts
Update frontend validation in knowledgeStore.ts
Update TypeScript types in database.ts (optional)
No database migration needed!

Known Issues

DOCX: Placeholder implementation, requires mammoth library
First Run: Model download may take 1-2 minutes
Large Files: Processing 90MB+ files may be slow

Future Work

RAG tool in chat interface (feat: Complete RAG System Implementation with Local Embeddings #178)
Settings integration (feat: Complete RAG System Implementation with Local Embeddings #178)
System prompt with RAG context (feat: Complete RAG System Implementation with Local Embeddings #178)
DOCX support with mammoth
Excel, PowerPoint support
Semantic chunking
Document preview in UI
Export/import knowledge base

Screenshots

TODO: Add screenshots of KnowledgePage, upload flow, delete confirmation

References

Issue: feat: Complete RAG System Implementation with Local Embeddings #178
Plan: /.claude/plans/glittery-foraging-wadler.md
LanceDB Documentation
HuggingFace Transformers.js
Vectorstores.org Readers

Ready for review ✅

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Add local RAG (Retrieval-Augmented Generation) system with ChromaDB and HuggingFace embeddings. ## Phase 1: Dependencies & Foundation - Install dependencies: chromadb, @chroma-core/default-embed, @xenova/transformers - Create database migration v6 for documents table - Add Document types to database.ts (Document, CreateDocumentInput, UpdateDocumentInput, GetDocumentsQuery) - Update DirectoryService with documents/ and chromadb/ subdirectory paths ## Phase 2: RAG Service Layer - Create DocumentProcessor service for file processing and text chunking - Support for TXT, MD, JSON file types - Chunking with 512 tokens, 128 overlap - Placeholders for PDF/DOCX (to be implemented) - Create RAGService with ChromaDB + HuggingFace embeddings - Uses Xenova/all-MiniLM-L6-v2 embedding model - Document ingestion, embedding generation, knowledge retrieval - ChromaDB collection management - Update DatabaseService with migration v6 - Documents table with status tracking (processing/indexed/failed) - Indexes for efficient querying - Create DocumentService for CRUD operations - Document creation, retrieval, update, deletion - Status management and document counting ## Technical Details - Embedding model: Xenova/all-MiniLM-L6-v2 (local, offline) - Vector database: ChromaDB (persists to ~/levante/chromadb/) - Document storage: ~/levante/documents/ - File size limit: 100MB per file - Chunk configuration: 512 tokens with 128 overlap (hardcoded) ## Remaining Work - Phase 3: IPC handlers and preload API - Phase 4: AI integration (RAG tool for chat) - Phase 5: UI components (KnowledgePage, stores) - Phase 6: Settings integration and navigation - Phase 7: Polish, translations, error handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ddings - Add RAG service with local embeddings (Xenova/all-MiniLM-L6-v2) - Integrate LanceDB for vector storage - Create Knowledge Base UI for document management - Add chunk ID tracking in SQLite for precise deletion - Implement dedicated RAG logging category (DEBUG_RAG) - Add comprehensive logging for critical paths: - Embedding model download with error diagnostics - PDF extraction with detailed error handling - Progress tracking for large document processing - Support PDF, TXT, MD, JSON file types - Add RAG search toggle in chat interface - Include delete all documents with confirmation modal Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implement vectorstores.org pattern with individual file readers for better maintainability and extensibility. Remove database CHECK constraint on file_type to allow adding new file types without migrations. Key changes: - Create dedicated reader classes (TextFileReader, PDFFileReader, JSONFileReader, DOCXFileReader, CSVFileReader, HTMLFileReader) - Implement FileReaderFactory with extension-to-reader mapping - Add support for CSV, HTML, and HTM file types - Remove file_type CHECK constraint from database (validation now in code) - Update DocumentProcessor to use reader factory pattern - Update frontend and IPC layer validation for new file types Migration 0008 removes CHECK constraint, allowing future file types to be added by creating new reader classes without requiring database migrations. Benefits: - Separation of concerns: each file type has dedicated reader - Easy extensibility: add new file types by creating reader class - No DB migrations needed for new file types - Follows vectorstores.org recommended patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This commit implements a robust solution for handling complex native dependencies (LanceDB, Xenova, libSQL) in the RAG system, resolving module resolution issues and preparing the build system for production. Key Changes: - Disable ASAR entirely (asar: false) for guaranteed module resolution - Include peerDependencies in dependency scanning (fixes apache-arrow) - Remove tslib from build-time filters (provides runtime helpers) - Add comprehensive native dependencies documentation Technical Details: - forge.config.js: - ASAR disabled (line 194) - most reliable for complex deps - peerDependencies included in getAllDependencies() (line 78) - tslib removed from UNNECESSARY_DEPS blacklist (line 9) - Enhanced build logging and metrics tracking - vite.main.config.ts: - Updated comments to reflect ASAR disabled configuration - Maintained external package declarations - docs/architecture/native-dependencies.md: (NEW) - Complete documentation with Quick Reference section - Production Readiness Checklist - Troubleshooting guide for apache-arrow and tslib issues - Build metrics and platform-specific handling - Critical lessons learned - CLAUDE.md: - Updated Native Dependencies section - Added critical configuration examples - Documented build results (180 packages, 17 bindings) Build Results: - Packages copied: 180 (up from 155) - Build-time filtered: 5 (typescript, @types/*, vite, esbuild, rollup) - Native bindings: 17 .node files - App size: ~1.1 GB - Build time: ~2m30s Issues Resolved: - ✅ "Cannot find module 'apache-arrow'" - peerDependency not scanned - ✅ "Cannot find module 'tslib'" - incorrectly filtered as build-only - ✅ ASAR unpack patterns unreliable - disabled ASAR completely Testing: - Verified on macOS (darwin-arm64) - RAG functionality tested (document upload, indexing, search) - No runtime module errors Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

creative-CLAi

🔍 Review: RAG System Implementation

✅ Overall Assessment: APPROVE

Este es un PR excelente con una arquitectura muy bien pensada. El sistema RAG está implementado de forma robusta y extensible.

🌟 Puntos Destacados

1. Arquitectura de FileReaders (Pattern Factory)
La implementación del patrón Factory para los readers es muy limpia. Permite añadir nuevos tipos de archivo sin migraciones de DB - excelente decisión.

2. Manejo de Dependencias Nativas
La documentación en docs/architecture/native-dependencies.md es impresionante. Cubre todos los edge cases de ASAR y dependencias transitivas.

3. Sistema de Logging
Añadir DEBUG_RAG como categoría separada facilita el debugging sin ruido de otros componentes.

4. Migraciones de DB
Las 3 migraciones (0006-0008) están bien estructuradas. Especialmente bueno quitar el CHECK constraint para extensibilidad.

💡 Sugerencias Menores (No-Blockers)

1. SQL Injection en deleteDocument
En ragService.ts línea ~260:

const idsString = chunkIds.map(id => \`'${id}'\`).join(', ');
await this.table.delete(\`id IN (${idsString})\`);

Los IDs vienen de la DB (generados internamente), pero sería más seguro usar parámetros si LanceDB lo soporta. No es crítico ya que los IDs son UUIDs internos.

2. Considerar timeout para descarga del modelo

// En initialize()
this.embedder = await pipeline('feature-extraction', DEFAULT_EMBEDDING_MODEL);

Podría añadir un timeout/retry para conexiones lentas. El comentario lo menciona pero no hay handling específico.

3. DOCXFileReader placeholder
Está bien documentado que es placeholder - solo asegurar que no se liste como "soportado" en el UI hasta implementar mammoth.

📊 Métricas

+5783 líneas - Feature completa y bien documentada
55 archivos - Buena modularización
3 migraciones - Cambios de DB bien versionados
8 tipos de archivo - Cobertura sólida inicial

🧪 Testing Recomendado

Upload de archivos grandes (90MB+) para verificar performance
Offline mode después de primera descarga del modelo
Delete all con confirmación
Concurrent uploads

Aprobado ✅ - Excelente trabajo de arquitectura. Las sugerencias son mejoras opcionales para futuras iteraciones.

Co-Reviewed-By: CLAi 🤖

olivermontes and others added 4 commits January 17, 2026 13:42

creative-CLAi approved these changes Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Complete RAG System Implementation with Local Embeddings#179

feat: Complete RAG System Implementation with Local Embeddings#179
olivermontes wants to merge 4 commits intodevelopfrom
feat/rag-system

olivermontes commented Jan 17, 2026

Uh oh!

creative-CLAi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

olivermontes commented Jan 17, 2026

RAG System Implementation

Overview

Key Features

Architecture

RAG Flow

File Reader Pattern (vectorstores.org)

Commits

Database Changes

Migration 0006: documents table

Migration 0007: chunk_ids column

Migration 0008: Remove file_type constraint

File Structure

New Files

Modified Files

Configuration

Environment Variables

Storage Paths

Testing

✅ Tested

🔄 Pending

Breaking Changes

Migration Guide

Known Issues

Future Work

Screenshots

References

Uh oh!

creative-CLAi left a comment

Choose a reason for hiding this comment

🔍 Review: RAG System Implementation

✅ Overall Assessment: APPROVE

🌟 Puntos Destacados

💡 Sugerencias Menores (No-Blockers)

📊 Métricas

🧪 Testing Recomendado

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants