Skip to content

Redundant Parsing When Re-loading the Same Document #1

@gidea

Description

@gidea

Documents are fully re-parsed every time they are opened, even when the same file has already been loaded in the current session. This results in unnecessary work for PDF/DOCX/PPTX parsing and slows down the overall processing pipeline.

Details

  • The app does not cache parsed document state (tokens, metadata, chunking config, etc.).
  • Loading a file triggers a full parse regardless of file identity or modification time.
  • Re-loading large files significantly increases processing time and blocks the UI during parsing.

Expected Behavior
If a document with the same absolute path (or hash) has already been parsed, the existing parsed model should be reused unless the underlying file has changed.

Proposed Direction

  • Introduce a document cache keyed by path or checksum.
  • Store parsed content + metadata + chunking settings in memory for the session.
  • Add basic change detection (mtime or hash comparison) before invalidating the cache.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions