Skip to content

Epic: Phase 3 - Document & Office Format Awareness #91

@coderabbitai

Description

@coderabbitai

Overview

Add format-aware parsing for document and office file formats to extract metadata, text, and embedded resources.

Parent Project

Part of #76 - Post-1.0 Non-Executable Binary Format Awareness & Entropy Filtering

Formats to Support

  • PDF (structure parsing, metadata, JavaScript)
  • Microsoft Office (DOC, XLS, PPT via OLE2)
  • OpenDocument (ODT, ODS, ODP)
  • RTF (Rich Text Format)
  • PostScript/EPS

Goals

  • Parse document structure trees
  • Extract metadata (author, title, keywords)
  • Identify text streams vs. binary content
  • Handle embedded objects
  • Parse font tables and resource dictionaries
  • Extract embedded scripts (JavaScript in PDFs)

Success Criteria

  • Extract 95%+ of document metadata
  • Identify and skip binary image data
  • Extract embedded JavaScript from PDFs
  • Handle password-protected documents (metadata only)
  • Performance: <40% overhead vs. raw extraction
  • Support for large documents (>100MB)

Feature Issues

This epic will be broken down into the following features:

  • PDF structure parser and metadata extraction
  • Microsoft Office (OLE2) format support
  • OpenDocument format support
  • RTF format parser
  • PostScript/EPS format support
  • Embedded JavaScript extraction (PDF)

Dependencies

  • Rust crates: lopdf, zip (for OOXML), cfb (for OLE2)
  • Phase 1 (Entropy Analysis) completion

Timeline

Target: Q2

Sub-issues

Metadata

Metadata

Assignees

Labels

epicLarge feature or initiative spanning multiple taskspriority:highHigh priority task

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions