-
-
Notifications
You must be signed in to change notification settings - Fork 0
Labels
epicLarge feature or initiative spanning multiple tasksLarge feature or initiative spanning multiple taskspriority:highHigh priority taskHigh priority task
Description
Overview
Add format-aware parsing for document and office file formats to extract metadata, text, and embedded resources.
Parent Project
Part of #76 - Post-1.0 Non-Executable Binary Format Awareness & Entropy Filtering
Formats to Support
- PDF (structure parsing, metadata, JavaScript)
- Microsoft Office (DOC, XLS, PPT via OLE2)
- OpenDocument (ODT, ODS, ODP)
- RTF (Rich Text Format)
- PostScript/EPS
Goals
- Parse document structure trees
- Extract metadata (author, title, keywords)
- Identify text streams vs. binary content
- Handle embedded objects
- Parse font tables and resource dictionaries
- Extract embedded scripts (JavaScript in PDFs)
Success Criteria
- Extract 95%+ of document metadata
- Identify and skip binary image data
- Extract embedded JavaScript from PDFs
- Handle password-protected documents (metadata only)
- Performance: <40% overhead vs. raw extraction
- Support for large documents (>100MB)
Feature Issues
This epic will be broken down into the following features:
- PDF structure parser and metadata extraction
- Microsoft Office (OLE2) format support
- OpenDocument format support
- RTF format parser
- PostScript/EPS format support
- Embedded JavaScript extraction (PDF)
Dependencies
- Rust crates: lopdf, zip (for OOXML), cfb (for OLE2)
- Phase 1 (Entropy Analysis) completion
Timeline
Target: Q2
Sub-issues
Metadata
Metadata
Assignees
Labels
epicLarge feature or initiative spanning multiple tasksLarge feature or initiative spanning multiple taskspriority:highHigh priority taskHigh priority task