Skip to content

feat: Complete RAG System Implementation with Local Embeddings#179

Open
olivermontes wants to merge 4 commits intodevelopfrom
feat/rag-system
Open

feat: Complete RAG System Implementation with Local Embeddings#179
olivermontes wants to merge 4 commits intodevelopfrom
feat/rag-system

Conversation

@olivermontes
Copy link
Contributor

RAG System Implementation

Closes #178

Overview

Complete implementation of local Retrieval-Augmented Generation (RAG) system for Levante with offline support and extensible file processing architecture.

Key Features

Fully Local RAG

  • HuggingFace Transformers embeddings (Xenova/all-MiniLM-L6-v2)
  • LanceDB vector database with offline support
  • All data stays local, zero external API calls
  • Privacy-first approach

Extensible File Support

  • PDF, DOCX, TXT, MD, JSON, CSV, HTML, HTM (8 types)
  • FileReader pattern following vectorstores.org architecture
  • Add new file types without database migrations
  • Individual reader classes for separation of concerns

Robust Document Management

  • Upload with 100MB limit and validation
  • Chunk tracking with precise deletion
  • Delete all with confirmation modal
  • Real-time stats (total, indexed, processing, failed)

Developer Experience

  • Dedicated RAG logging category (DEBUG_RAG)
  • Comprehensive error messages with possible causes
  • Database migrations (0006, 0007, 0008)
  • Migration runner in UI

Architecture

RAG Flow

Upload → DocumentProcessor → FileReader → Chunker → HuggingFace → LanceDB
                                                                       ↓
Chat → RAG Tool → Vector Search → Context Retrieval → LLM Response

File Reader Pattern (vectorstores.org)

DocumentProcessor → FileReaderFactory → Individual Readers
                          ↓
       Extension Mapping (pdf→PDFReader, csv→CSVReader, etc.)

Commits

  1. bcbb8d4 - RAG system foundation

    • Dependencies and database setup
    • Core services (RAGService, DocumentProcessor, DocumentService)
    • IPC layer and preload API
  2. eca3427 - Complete RAG implementation

    • UI components (KnowledgePage, KnowledgeStore)
    • Chunk ID tracking for precise deletion
    • RAG logging category
    • Delete all functionality
  3. 5894574 - Extensible file reader architecture

    • FileReader pattern implementation
    • CSV and HTML support
    • Removed file_type CHECK constraint (Migration 0008)
    • Code-based validation only

Database Changes

Migration 0006: documents table

CREATE TABLE documents (
  id TEXT PRIMARY KEY,
  filename TEXT NOT NULL,
  filepath TEXT NOT NULL,
  file_type TEXT NOT NULL,
  file_size INTEGER NOT NULL,
  status TEXT NOT NULL DEFAULT 'processing',
  chunk_count INTEGER DEFAULT 0,
  error_message TEXT DEFAULT NULL,
  uploaded_at INTEGER NOT NULL,
  indexed_at INTEGER DEFAULT NULL
);

Migration 0007: chunk_ids column

  • Added chunk_ids TEXT DEFAULT NULL for LanceDB chunk tracking
  • Enables precise deletion using chunk IDs

Migration 0008: Remove file_type constraint

  • Removed CHECK constraint on file_type
  • Validation now in application code (FileReaderFactory)
  • No DB migration needed for new file types

File Structure

New Files

database/migrations/
├── 0006_add_documents.sql
├── 0007_add_chunk_ids.sql
└── 0008_expand_file_types.sql

src/main/services/rag/
├── documentProcessor.ts
├── ragService.ts
└── readers/
    ├── IFileReader.ts
    ├── TextFileReader.ts
    ├── JSONFileReader.ts
    ├── PDFFileReader.ts
    ├── DOCXFileReader.ts
    ├── CSVFileReader.ts
    ├── HTMLFileReader.ts
    ├── FileReaderFactory.ts
    └── index.ts

src/main/services/documentService.ts
src/main/ipc/documentHandlers.ts
src/preload/api/documents.ts

src/renderer/pages/KnowledgePage.tsx
src/renderer/stores/knowledgeStore.ts

Modified Files

  • src/types/database.ts - Document types
  • src/main/services/databaseService.ts - Migrations
  • src/renderer/stores/knowledgeStore.ts - State management
  • package.json - Dependencies

Configuration

Environment Variables

DEBUG_RAG=true    # Enable RAG logging
DEBUG_ENABLED=true
LOG_LEVEL=debug

Storage Paths

  • Documents: ~/levante/documents/
  • Vector DB: ~/levante/lancedb/
  • HuggingFace cache: ~/.cache/huggingface/

Testing

✅ Tested

  • Upload flow (PDF, MD, JSON, CSV tested)
  • File type validation
  • Document deletion with chunk cleanup
  • Delete all with confirmation
  • Chunk ID tracking
  • Database migrations
  • Stats display
  • Error handling

🔄 Pending

  • RAG tool integration in chat (AI queries)
  • Settings toggle
  • System prompt integration
  • E2E RAG queries
  • DOCX support (needs mammoth library)
  • Large file performance (90MB+)

Breaking Changes

None. This is a new feature with no impact on existing functionality.

Migration Guide

First run:

  1. App will download HuggingFace model (~100MB) to ~/.cache/huggingface/
  2. Migrations 0006, 0007, 0008 run automatically
  3. Navigate to Knowledge page to upload documents

Adding new file types (developers):

  1. Create reader class in src/main/services/rag/readers/ (e.g., ExcelFileReader.ts)
  2. Export in readers/index.ts
  3. Update frontend validation in knowledgeStore.ts
  4. Update TypeScript types in database.ts (optional)
  5. No database migration needed!

Known Issues

  1. DOCX: Placeholder implementation, requires mammoth library
  2. First Run: Model download may take 1-2 minutes
  3. Large Files: Processing 90MB+ files may be slow

Future Work

Screenshots

TODO: Add screenshots of KnowledgePage, upload flow, delete confirmation

References


Ready for review

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

olivermontes and others added 4 commits January 17, 2026 13:42
Add local RAG (Retrieval-Augmented Generation) system with ChromaDB and HuggingFace embeddings.

## Phase 1: Dependencies & Foundation
- Install dependencies: chromadb, @chroma-core/default-embed, @xenova/transformers
- Create database migration v6 for documents table
- Add Document types to database.ts (Document, CreateDocumentInput, UpdateDocumentInput, GetDocumentsQuery)
- Update DirectoryService with documents/ and chromadb/ subdirectory paths

## Phase 2: RAG Service Layer
- Create DocumentProcessor service for file processing and text chunking
  - Support for TXT, MD, JSON file types
  - Chunking with 512 tokens, 128 overlap
  - Placeholders for PDF/DOCX (to be implemented)
- Create RAGService with ChromaDB + HuggingFace embeddings
  - Uses Xenova/all-MiniLM-L6-v2 embedding model
  - Document ingestion, embedding generation, knowledge retrieval
  - ChromaDB collection management
- Update DatabaseService with migration v6
  - Documents table with status tracking (processing/indexed/failed)
  - Indexes for efficient querying
- Create DocumentService for CRUD operations
  - Document creation, retrieval, update, deletion
  - Status management and document counting

## Technical Details
- Embedding model: Xenova/all-MiniLM-L6-v2 (local, offline)
- Vector database: ChromaDB (persists to ~/levante/chromadb/)
- Document storage: ~/levante/documents/
- File size limit: 100MB per file
- Chunk configuration: 512 tokens with 128 overlap (hardcoded)

## Remaining Work
- Phase 3: IPC handlers and preload API
- Phase 4: AI integration (RAG tool for chat)
- Phase 5: UI components (KnowledgePage, stores)
- Phase 6: Settings integration and navigation
- Phase 7: Polish, translations, error handling

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ddings

- Add RAG service with local embeddings (Xenova/all-MiniLM-L6-v2)
- Integrate LanceDB for vector storage
- Create Knowledge Base UI for document management
- Add chunk ID tracking in SQLite for precise deletion
- Implement dedicated RAG logging category (DEBUG_RAG)
- Add comprehensive logging for critical paths:
  - Embedding model download with error diagnostics
  - PDF extraction with detailed error handling
  - Progress tracking for large document processing
- Support PDF, TXT, MD, JSON file types
- Add RAG search toggle in chat interface
- Include delete all documents with confirmation modal

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement vectorstores.org pattern with individual file readers for better
maintainability and extensibility. Remove database CHECK constraint on file_type
to allow adding new file types without migrations.

Key changes:
- Create dedicated reader classes (TextFileReader, PDFFileReader, JSONFileReader,
  DOCXFileReader, CSVFileReader, HTMLFileReader)
- Implement FileReaderFactory with extension-to-reader mapping
- Add support for CSV, HTML, and HTM file types
- Remove file_type CHECK constraint from database (validation now in code)
- Update DocumentProcessor to use reader factory pattern
- Update frontend and IPC layer validation for new file types

Migration 0008 removes CHECK constraint, allowing future file types to be added
by creating new reader classes without requiring database migrations.

Benefits:
- Separation of concerns: each file type has dedicated reader
- Easy extensibility: add new file types by creating reader class
- No DB migrations needed for new file types
- Follows vectorstores.org recommended patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements a robust solution for handling complex native
dependencies (LanceDB, Xenova, libSQL) in the RAG system, resolving
module resolution issues and preparing the build system for production.

Key Changes:
- Disable ASAR entirely (asar: false) for guaranteed module resolution
- Include peerDependencies in dependency scanning (fixes apache-arrow)
- Remove tslib from build-time filters (provides runtime helpers)
- Add comprehensive native dependencies documentation

Technical Details:
- forge.config.js:
  - ASAR disabled (line 194) - most reliable for complex deps
  - peerDependencies included in getAllDependencies() (line 78)
  - tslib removed from UNNECESSARY_DEPS blacklist (line 9)
  - Enhanced build logging and metrics tracking

- vite.main.config.ts:
  - Updated comments to reflect ASAR disabled configuration
  - Maintained external package declarations

- docs/architecture/native-dependencies.md: (NEW)
  - Complete documentation with Quick Reference section
  - Production Readiness Checklist
  - Troubleshooting guide for apache-arrow and tslib issues
  - Build metrics and platform-specific handling
  - Critical lessons learned

- CLAUDE.md:
  - Updated Native Dependencies section
  - Added critical configuration examples
  - Documented build results (180 packages, 17 bindings)

Build Results:
- Packages copied: 180 (up from 155)
- Build-time filtered: 5 (typescript, @types/*, vite, esbuild, rollup)
- Native bindings: 17 .node files
- App size: ~1.1 GB
- Build time: ~2m30s

Issues Resolved:
- ✅ "Cannot find module 'apache-arrow'" - peerDependency not scanned
- ✅ "Cannot find module 'tslib'" - incorrectly filtered as build-only
- ✅ ASAR unpack patterns unreliable - disabled ASAR completely

Testing:
- Verified on macOS (darwin-arm64)
- RAG functionality tested (document upload, indexing, search)
- No runtime module errors

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link

@creative-CLAi creative-CLAi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Review: RAG System Implementation

✅ Overall Assessment: APPROVE

Este es un PR excelente con una arquitectura muy bien pensada. El sistema RAG está implementado de forma robusta y extensible.


🌟 Puntos Destacados

1. Arquitectura de FileReaders (Pattern Factory)
La implementación del patrón Factory para los readers es muy limpia. Permite añadir nuevos tipos de archivo sin migraciones de DB - excelente decisión.

2. Manejo de Dependencias Nativas
La documentación en docs/architecture/native-dependencies.md es impresionante. Cubre todos los edge cases de ASAR y dependencias transitivas.

3. Sistema de Logging
Añadir DEBUG_RAG como categoría separada facilita el debugging sin ruido de otros componentes.

4. Migraciones de DB
Las 3 migraciones (0006-0008) están bien estructuradas. Especialmente bueno quitar el CHECK constraint para extensibilidad.


💡 Sugerencias Menores (No-Blockers)

1. SQL Injection en deleteDocument
En ragService.ts línea ~260:

const idsString = chunkIds.map(id => \`'${id}'\`).join(', ');
await this.table.delete(\`id IN (${idsString})\`);

Los IDs vienen de la DB (generados internamente), pero sería más seguro usar parámetros si LanceDB lo soporta. No es crítico ya que los IDs son UUIDs internos.

2. Considerar timeout para descarga del modelo

// En initialize()
this.embedder = await pipeline('feature-extraction', DEFAULT_EMBEDDING_MODEL);

Podría añadir un timeout/retry para conexiones lentas. El comentario lo menciona pero no hay handling específico.

3. DOCXFileReader placeholder
Está bien documentado que es placeholder - solo asegurar que no se liste como "soportado" en el UI hasta implementar mammoth.


📊 Métricas

  • +5783 líneas - Feature completa y bien documentada
  • 55 archivos - Buena modularización
  • 3 migraciones - Cambios de DB bien versionados
  • 8 tipos de archivo - Cobertura sólida inicial

🧪 Testing Recomendado

  1. Upload de archivos grandes (90MB+) para verificar performance
  2. Offline mode después de primera descarga del modelo
  3. Delete all con confirmación
  4. Concurrent uploads

Aprobado ✅ - Excelente trabajo de arquitectura. Las sugerencias son mejoras opcionales para futuras iteraciones.

Co-Reviewed-By: CLAi 🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants