Skip to content

Conversation

@ahmedjawedaj
Copy link
Contributor

Summary

This PR fixes folder sync functionality with two key improvements:

1. Fixed New File Detection (Source Path Tracking)

Problem: Files always appeared as "new" after sync because:

  • detect_folder_changes checks source paths (in linked folder)
  • synced_files stored destination paths (in KB raw directory)
  • These never matched, so files were perpetually marked as new

Solution: Now stores source paths in synced_files, matching what detect_folder_changes expects.

2. LightRAG Fallback for RAGAnything

Problem: When RAGAnything module was unavailable, sync would fail with ImportError.

Solution: Gracefully falls back to LightRAG pipeline:

  • Extracts text from PDF (PyMuPDF), DOCX (python-docx), TXT, MD files
  • Indexes content via LightRAG for knowledge graph building

Files Changed

  • src/api/routers/knowledge.py - Added folder_id param and source path tracking
  • src/knowledge/add_documents.py - Added LightRAG fallback methods

Testing

  • New file detection works correctly after sync
  • Progress bar shows proper status
  • Files no longer appear as perpetually "new"
  • LightRAG fallback processes documents when RAGAnything unavailable

- Add get_kb_content() to list documents/images
- Add /content API endpoint
- Fix folder_id parameter in upload task
- All tested and working on macOS
## Changes since last push:

### 1. Fixed folder sync state tracking (knowledge.py)
- Added 'folder_id' parameter to run_upload_processing_task
- Implemented source path tracking: Now stores original source paths
  from linked folders instead of destination paths in synced_files
- This fixes the bug where files always appeared as 'new' after sync
  because detect_folder_changes checks source paths but synced_files
  contained destination paths

### 2. Added LightRAG fallback (add_documents.py)
- When RAGAnything module is unavailable, gracefully falls back to
  LightRAG pipeline instead of raising ImportError
- New _process_with_lightrag_fallback method handles text extraction
  and indexing for PDF, DOCX, TXT, and MD files
- New _extract_text_content helper for extracting text from various
  document formats using PyMuPDF and python-docx

### 3. Improved error handling
- Added null checks for processed_files in progress messages
- Better logging for sync state updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant