Woodchipper

All PDF files must be fed to the woodchipper.

description = 'A Python library, CLI tool, and HTTP API for analyzing PDF files. Extracts metadata, computes file hashes, finds embedded URLs, detects forms, AcroForms, XFA, JavaScript, embedded files, rich media, and identifies suspicious actions and anomalies. All output is automatically defanged for safe handling.'
author = 'Ryan C. Moon'
version = '1.1.2' date = '2026-02-17'

To report a bug, please open an issue or submit a PR.

Installation

pip install woodchipper

With HTTP server support:

pip install woodchipper[server]

Or from source:

pip install -r requirements.txt
pip install -e .

Note: Requires libmagic for file type detection:

Debian/Ubuntu: apt install libmagic1
macOS: brew install libmagic
Fedora: dnf install file-libs

CLI Usage

woodchipper <path-to-pdf>

Output is JSON with all string fields defanged:

{
  "filename": "suspicious.pdf",
  "filesize": 142857,
  "md5": "d41d8cd98f00b204e9800998ecf8427e",
  "sha1": "da39a3ee5e6b4b0d3255bfef95601890afd80709",
  "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
  "urls": [
    "hXXps://example[.]com",
    "hXXps://test[.]org/page"
  ],
  "metadata": {
    "author": "John Doe",
    "creator": "Microsoft Word",
    "producer": "Microsoft: Print To PDF",
    "subject": null,
    "title": "Important Document",
    "creation_date": "2024-01-15 12:00:00+00:00",
    "modification_date": "2024-01-16 09:00:00+00:00",
    "spoofing_indicators": []
  },
  "anomalies": {
    "anomalies_present": true,
    "anomalies": [
      "PDF header not at byte 0 (found at byte 18)",
      "Rich media detected: /3D (3D content). Rich media includes 3D or Flash streams which are no longer supported by modern OS or in common usage. This is suspicious.",
      "Stream length mismatches detected (1 stream(s)). Mismatched stream lengths may indicate PDF tampering, corruption, or malicious manipulation."
    ],
    "additional_actions_detected": [
      {
        "description": "Document Open: JavaScript execution (code: app.alert('Hello');...)",
        "offset": 361,
        "raw_bytes": "35 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 4a 53 ...",
        "decoded_bytes": null
      }
    ],
    "external_actions": [
      {
        "description": "/Launch at Document OpenAction: Launches external application (file: cmd.exe)",
        "offset": 274,
        "raw_bytes": "34 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 46 ...",
        "decoded_bytes": null
      },
      {
        "description": "/URI at Page 1 Annotation 1: Opens URL (hXXps://malicious[.]com)",
        "offset": 37039,
        "raw_bytes": "31 30 20 30 20 6f 62 6a 0d 0a 3c 3c ...",
        "decoded_bytes": null
      }
    ],
    "javascript_detected": [
      {
        "description": "Document OpenAction: displays alert (code: app.alert('Hello');)",
        "offset": 361,
        "raw_bytes": "35 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 4a 53 ...",
        "decoded_bytes": null
      }
    ],
    "embedded_files": [
      {
        "name": "malware.exe",
        "file_type": "PE32 executable (GUI) Intel 80386",
        "mime_type": "application/x-dosexec",
        "size": 45056,
        "description": "Click to open",
        "offset": 892,
        "raw_bytes": "37 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 54 79 70 65 ...",
        "decoded_bytes": "4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 ..."
      }
    ],
    "acroform_details": [
      {
        "description": "AcroForm detected in document catalog",
        "offset": 512,
        "raw_bytes": "34 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 46 69 65 6c ...",
        "decoded_bytes": null
      },
      {
        "description": "Form contains 3 top-level field(s)",
        "offset": 512,
        "raw_bytes": "34 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 46 69 65 6c ...",
        "decoded_bytes": null
      },
      {
        "description": "Field 'username': type=Text, flags=[Required]",
        "offset": 1024,
        "raw_bytes": "38 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 46 54 2f 54 ...",
        "decoded_bytes": null
      }
    ],
    "xfa_details": [
      {
        "description": "XFA (XML Forms Architecture) detected",
        "offset": 512,
        "raw_bytes": "34 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 58 46 41 ...",
        "decoded_bytes": null
      },
      {
        "description": "XFA JavaScript script in template - behaviors: contains URL: var url = 'http://...'...",
        "offset": 1280,
        "raw_bytes": "39 20 30 20 6f 62 6a 0d 0a 3c 3c 2f 4c 65 6e ...",
        "decoded_bytes": "3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 ..."
      }
    ]
  },
  "forms": {
    "forms_present": true,
    "form_submission_targets": [
      "hXXps://collect[.]malicious[.]com/submit"
    ]
  }
}

Exit codes:

0 - Success
1 - Validation error (file not found, not readable, or not a PDF)

HTTP API

Woodchipper includes a FastAPI-based HTTP server for remote PDF analysis.

Running the Server

# Install with server dependencies
pip install woodchipper[server]

# Run with the built-in command
woodchipper-server

# Or run directly with uvicorn
uvicorn woodchipper.server:app --host 0.0.0.0 --port 8080

# Or with auto-reload for development
uvicorn woodchipper.server:app --host 0.0.0.0 --port 8080 --reload

Endpoints

Method	Path	Description
GET	`/health`	Health check
POST	`/analyze`	Upload PDF as multipart form
POST	`/analyze/raw`	Send raw PDF bytes in body
GET	`/docs`	Interactive Swagger UI documentation

Usage Examples

# Multipart upload
curl -X POST -F "file=@suspicious.pdf" http://localhost:8080/analyze

# Raw bytes
curl -X POST -H "Content-Type: application/pdf" \
  --data-binary @suspicious.pdf \
  http://localhost:8080/analyze/raw

# Health check
curl http://localhost:8080/health

Response

Both /analyze and /analyze/raw return the same JSON report as the CLI:

{
  "filename": "suspicious.pdf",
  "filesize": 142857,
  "md5": "...",
  "sha256": "...",
  "urls": ["hXXps://example[.]com"],
  "metadata": { ... },
  "anomalies": { ... },
  "forms": { ... }
}

Error Handling

400 Bad Request - Invalid file, empty body, or not a PDF
200 OK - Analysis successful, returns JSON report

Library API

`process(file_path) -> PdfReport`

Process a PDF file and return a full report with all analysis.

from woodchipper import process

report = process("document.pdf")
print(report["sha256"])
print(report["urls"])  # Defanged URLs
print(report["metadata"]["spoofing_indicators"])

# Action fields return ActionDetail dicts with description, offset, raw_bytes, and decoded_bytes
for action in report["anomalies"]["additional_actions_detected"]:
    print(f"{action['description']} @ offset {action['offset']}")
for action in report["anomalies"]["external_actions"]:
    print(f"{action['description']} @ offset {action['offset']}")
for js in report["anomalies"]["javascript_detected"]:
    print(f"{js['description']} @ offset {js['offset']}")

print(report["anomalies"]["embedded_files"])
print(report["anomalies"]["acroform_details"])
print(report["anomalies"]["xfa_details"])

Note: All string fields in the returned report are defanged for safe handling.

`get_urls(file_path) -> list[str]`

Extract URLs from a PDF file (returns raw URLs, not defanged).

from woodchipper import get_urls

urls = get_urls("document.pdf")
for url in urls:
    print(url)

Extracts URLs from /Link annotations and /A (Action) dictionaries with /URI entries.

`get_pdf_metadata(file_path) -> PdfMetadata`

Extract document metadata with spoofing detection.

from woodchipper import get_pdf_metadata

metadata = get_pdf_metadata("document.pdf")
print(f"Author: {metadata['author']}")
print(f"Creator: {metadata['creator']}")

if metadata["spoofing_indicators"]:
    print("Potential spoofing detected:")
    for indicator in metadata["spoofing_indicators"]:
        print(f"  - {indicator}")

Spoofing detection includes:

Creation date after modification date
Timestamps in the future
Creator/producer mismatches (e.g., claims Microsoft Word but produced by LibreOffice)
Creation date before PDF format existed (pre-1993)

`check_anomalies(file_path) -> PdfAnomalies`

Check for PDF structural anomalies and suspicious content.

from woodchipper import check_anomalies

anomalies = check_anomalies("document.pdf")
if anomalies["anomalies_present"]:
    for anomaly in anomalies["anomalies"]:
        print(f"Anomaly: {anomaly}")
    for action in anomalies["additional_actions_detected"]:
        print(f"Action: {action['description']} (offset: {action['offset']})")
    for external in anomalies["external_actions"]:
        print(f"External: {external['description']} (offset: {external['offset']})")
    for js in anomalies["javascript_detected"]:
        print(f"JavaScript: {js['description']} (offset: {js['offset']})")
    for ef in anomalies["embedded_files"]:
        print(f"Embedded: {ef['name']} ({ef['mime_type']}) (offset: {ef['offset']})")
    for form in anomalies["acroform_details"]:
        print(f"AcroForm: {form['description']} (offset: {form['offset']})")
    for xfa in anomalies["xfa_details"]:
        print(f"XFA: {xfa['description']} (offset: {xfa['offset']})")

Detects:

PDF header not at byte 0 (embedded content)
Invalid PDF version
Missing or malformed binary marker
Missing %%EOF marker
Data after %%EOF (appended content)
Rich media (3D, Flash) - no longer supported, suspicious
Stream length mismatches (tampering indicator)
/OpenAction and /AA (Additional Actions) triggers
External actions (/Launch, /URI, /GoToR, /GoToE)
Embedded JavaScript with behavior analysis
Embedded files with file type detection
AcroForm details (field types, flags, actions)
XFA (XML Forms Architecture) with script extraction

`detect_additionalactions(file_path) -> list[ActionDetail]`

Detect automatic actions triggered by PDF events. Each result is an ActionDetail dict with description, offset (byte offset of the PDF object, or null for inline objects), raw_bytes (space-separated hex at that offset, or null), and decoded_bytes (decompressed stream content as hex, or null for non-stream objects).

from woodchipper import detect_additionalactions

actions = detect_additionalactions("document.pdf")
for action in actions:
    print(f"{action['description']} @ offset {action['offset']}")

Detects actions triggered by:

Document Open, Close, Save, Print
Page Open, Close

Action types identified:

JavaScript execution
Launch external application
Open URL
Submit form data
And more

`detect_external_actions(file_path) -> list[ActionDetail]`

Detect external action tags that access resources outside the PDF. Returns ActionDetail dicts with forensic offset, raw bytes, and decoded stream bytes.

from woodchipper import detect_external_actions

actions = detect_external_actions("document.pdf")
for action in actions:
    print(f"{action['description']} @ offset {action['offset']}")

Detects:

/Launch - Launches external applications (e.g., cmd.exe, executables)
/URI - Opens URLs in browser
/GoToR - Opens remote PDF documents
/GoToE - Opens embedded documents

`detect_javascript(file_path) -> list[ActionDetail]`

Detect JavaScript code embedded in a PDF with behavior analysis. Returns ActionDetail dicts with forensic offset, raw bytes, and decoded stream bytes.

from woodchipper import detect_javascript

scripts = detect_javascript("document.pdf")
for script in scripts:
    print(f"{script['description']} @ offset {script['offset']}")

Detects /JavaScript and /JS tags in:

Document OpenAction
Document and page Additional Actions (/AA)
Named JavaScript in the Names tree
Annotation actions

Behavior analysis identifies:

displays alert - app.alert() calls
launches URL - app.launchURL() calls
submits form data - this.submitForm() calls
exports data - exportDataAsObject() calls
evaluates dynamic code - eval() usage
decodes obfuscated content - unescape(), fromCharCode()
makes network request - XMLHttp, SOAP calls
accesses form fields - this.getField() calls
sets timer/delayed execution - setInterval/setTimeout
potential heap spray - Collab.getIcon exploit pattern

`detect_embedded_file(file_path) -> list[EmbeddedFile]`

Detect files embedded or attached to a PDF.

from woodchipper import detect_embedded_file

files = detect_embedded_file("document.pdf")
for f in files:
    print(f"Name: {f['name']}")
    print(f"Type: {f['file_type']}")
    print(f"MIME: {f['mime_type']}")
    print(f"Size: {f['size']} bytes")
    print(f"Offset: {f['offset']}")

Detects embedded files in:

/Names tree /EmbeddedFiles entries
/FileSpec dictionaries with /EF streams
/FileAttachment annotations

Returns EmbeddedFile dict with:

name - Filename from the PDF
file_type - File type description (from magic bytes)
mime_type - MIME type (from magic bytes)
size - File size in bytes
description - Description from PDF metadata
offset - Byte offset of the FileSpec object in the PDF (or null)
raw_bytes - Space-separated hex bytes at the offset (or null)
decoded_bytes - Decompressed embedded file content as space-separated hex (up to 1024 bytes, or null)

`detect_acroform(file_path) -> list[ActionDetail]`

Detect and analyze AcroForm structures in a PDF. Returns ActionDetail dicts with forensic offset, raw bytes, and decoded stream bytes.

from woodchipper import detect_acroform

details = detect_acroform("document.pdf")
for detail in details:
    print(f"{detail['description']} (offset: {detail['offset']})")

Detects:

Presence of AcroForm in document catalog
XFA forms (XML Forms Architecture)
NeedAppearances flag (dynamic appearance generation)
Signature flags (SignaturesExist, AppendOnly)
Calculation Order (/CO) - automatic calculation scripts
Form field details:
- Field types (Text, Button, Choice, Signature)
- Field flags (ReadOnly, Required, Password, Multiline, etc.)
- Actions attached to fields
- JavaScript in field actions

`detect_xmlforms(file_path) -> list[ActionDetail]`

Detect and analyze XFA (XML Forms Architecture) in a PDF. Returns ActionDetail dicts with forensic offset, raw bytes, and decoded stream bytes.

from woodchipper import detect_xmlforms

details = detect_xmlforms("document.pdf")
for detail in details:
    print(f"{detail['description']} (offset: {detail['offset']})")

Detects:

XFA presence and structure (array or single stream)
XFA components (template, config, datasets, localeSet, etc.)
Embedded scripts (JavaScript and FormCalc)
Event handlers (onClick, onEnter, onChange, etc.)
Submit actions and URL targets
Dangerous operations:
- xfa.host.messageBox - message display
- xfa.host.exportData / importData - data operations
- xfa.host.gotoURL - URL navigation
- app.launchURL, app.execMenuItem - application actions
- ADBC.* - database connectivity
- Net.HTTP, SOAP.* - network operations

`detect_richmedia(file_path) -> list[str]`

Detect rich media content (3D, Flash, multimedia) in a PDF.

from woodchipper import detect_richmedia

findings = detect_richmedia("document.pdf")
for finding in findings:
    print(finding)

Detects:

/RichMedia annotations and related tags
/3D, /3DD, /3DA, /3DV, /3DI - 3D content
/U3D (Universal 3D), /PRC (Product Representation Compact)
/Flash, /Movie, /Sound, /Screen - multimedia
/Rendition, /GoTo3DView - multimedia actions

Note: Rich media like 3D and Flash are no longer supported by modern OS and PDF readers, making their presence suspicious.

`detect_stream_mismatches(file_path) -> list[str]`

Detect mismatches between declared and actual PDF stream lengths.

from woodchipper import detect_stream_mismatches

mismatches = detect_stream_mismatches("document.pdf")
for mismatch in mismatches:
    print(mismatch)

Detects:

Missing endstream markers (malformed structure)
Declared length exceeds actual (truncated/tampered data)
Actual length exceeds declared (injected data/buffer overflow attempt)
Indirect length reference mismatches

Note: Mismatched stream lengths may indicate PDF tampering, corruption, or malicious manipulation to hide content or exploit PDF parsers.

`extract_forms(file_path) -> PdfForms`

Detect PDF forms and extract submission targets.

from woodchipper import extract_forms

forms = extract_forms("document.pdf")
if forms["forms_present"]:
    print("Form submission targets:")
    for target in forms["form_submission_targets"]:
        print(f"  {target}")

`validate_pdf(file_path) -> Path`

Validate that a file exists, is readable, and is a PDF.

from woodchipper import validate_pdf, ValidationError

try:
    path = validate_pdf("document.pdf")
except ValidationError as e:
    print(f"Invalid: {e}")

`ValidationError`

Exception raised when file validation fails:

File not found
Path is not a file
File is not readable
File is not a PDF (based on magic bytes)

Output Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "filename": { "type": "string" },
    "filesize": { "type": "integer" },
    "md5": { "type": "string" },
    "sha1": { "type": "string" },
    "sha256": { "type": "string" },
    "urls": {
      "type": "array",
      "items": { "type": "string" }
    },
    "metadata": {
      "type": "object",
      "properties": {
        "author": { "type": ["string", "null"] },
        "creator": { "type": ["string", "null"] },
        "producer": { "type": ["string", "null"] },
        "subject": { "type": ["string", "null"] },
        "title": { "type": ["string", "null"] },
        "creation_date": { "type": ["string", "null"] },
        "modification_date": { "type": ["string", "null"] },
        "spoofing_indicators": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    },
    "anomalies": {
      "type": "object",
      "properties": {
        "anomalies_present": { "type": "boolean" },
        "anomalies": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Structural anomalies, rich media findings, and stream length mismatches"
        },
        "additional_actions_detected": {
          "type": "array",
          "items": { "$ref": "#/$defs/ActionDetail" }
        },
        "external_actions": {
          "type": "array",
          "items": { "$ref": "#/$defs/ActionDetail" }
        },
        "javascript_detected": {
          "type": "array",
          "items": { "$ref": "#/$defs/ActionDetail" }
        },
        "embedded_files": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": ["string", "null"] },
              "file_type": { "type": ["string", "null"] },
              "mime_type": { "type": ["string", "null"] },
              "size": { "type": ["integer", "null"] },
              "description": { "type": ["string", "null"] },
              "offset": { "type": ["integer", "null"] },
              "raw_bytes": { "type": ["string", "null"] },
              "decoded_bytes": { "type": ["string", "null"] }
            }
          }
        },
        "acroform_details": {
          "type": "array",
          "items": { "$ref": "#/$defs/ActionDetail" },
          "description": "AcroForm field details, types, flags, and actions"
        },
        "xfa_details": {
          "type": "array",
          "items": { "$ref": "#/$defs/ActionDetail" },
          "description": "XFA (XML Forms Architecture) components and scripts"
        }
      }
    },
    "forms": {
      "type": "object",
      "properties": {
        "forms_present": { "type": "boolean" },
        "form_submission_targets": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    }
  },
  "required": ["filename", "filesize", "md5", "sha1", "sha256", "urls", "metadata", "anomalies", "forms"],
  "$defs": {
    "ActionDetail": {
      "type": "object",
      "description": "Forensic detail for a detected PDF action",
      "properties": {
        "description": {
          "type": "string",
          "description": "Human-readable description of the action and its trigger"
        },
        "offset": {
          "type": ["integer", "null"],
          "description": "Byte offset of the PDF object in the file, or null for inline objects"
        },
        "raw_bytes": {
          "type": ["string", "null"],
          "description": "Space-separated hex-encoded raw bytes at the offset (up to 256 bytes), or null"
        },
        "decoded_bytes": {
          "type": ["string", "null"],
          "description": "Space-separated hex-encoded decompressed stream content (up to 1024 bytes), or null for non-stream objects"
        }
      },
      "required": ["description", "offset", "raw_bytes", "decoded_bytes"]
    }
  }
}

Development

Install in editable mode with all dependencies:

pip install -e ".[all]"

Or just dev dependencies:

pip install -e ".[dev]"

Run tests:

pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src/woodchipper		src/woodchipper
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Woodchipper

Installation

CLI Usage

HTTP API

Running the Server

Endpoints

Usage Examples

Response

Error Handling

Library API

`process(file_path) -> PdfReport`

`get_urls(file_path) -> list[str]`

`get_pdf_metadata(file_path) -> PdfMetadata`

`check_anomalies(file_path) -> PdfAnomalies`

`detect_additionalactions(file_path) -> list[ActionDetail]`

`detect_external_actions(file_path) -> list[ActionDetail]`

`detect_javascript(file_path) -> list[ActionDetail]`

`detect_embedded_file(file_path) -> list[EmbeddedFile]`

`detect_acroform(file_path) -> list[ActionDetail]`

`detect_xmlforms(file_path) -> list[ActionDetail]`

`detect_richmedia(file_path) -> list[str]`

`detect_stream_mismatches(file_path) -> list[str]`

`extract_forms(file_path) -> PdfForms`

`validate_pdf(file_path) -> Path`

`ValidationError`

Output Schema

Development

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Woodchipper

Installation

CLI Usage

HTTP API

Running the Server

Endpoints

Usage Examples

Response

Error Handling

Library API

process(file_path) -> PdfReport

get_urls(file_path) -> list[str]

get_pdf_metadata(file_path) -> PdfMetadata

check_anomalies(file_path) -> PdfAnomalies

detect_additionalactions(file_path) -> list[ActionDetail]

detect_external_actions(file_path) -> list[ActionDetail]

detect_javascript(file_path) -> list[ActionDetail]

detect_embedded_file(file_path) -> list[EmbeddedFile]

detect_acroform(file_path) -> list[ActionDetail]

detect_xmlforms(file_path) -> list[ActionDetail]

detect_richmedia(file_path) -> list[str]

detect_stream_mismatches(file_path) -> list[str]

extract_forms(file_path) -> PdfForms

validate_pdf(file_path) -> Path

ValidationError

Output Schema

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`process(file_path) -> PdfReport`

`get_urls(file_path) -> list[str]`

`get_pdf_metadata(file_path) -> PdfMetadata`

`check_anomalies(file_path) -> PdfAnomalies`

`detect_additionalactions(file_path) -> list[ActionDetail]`

`detect_external_actions(file_path) -> list[ActionDetail]`

`detect_javascript(file_path) -> list[ActionDetail]`

`detect_embedded_file(file_path) -> list[EmbeddedFile]`

`detect_acroform(file_path) -> list[ActionDetail]`

`detect_xmlforms(file_path) -> list[ActionDetail]`

`detect_richmedia(file_path) -> list[str]`

`detect_stream_mismatches(file_path) -> list[str]`

`extract_forms(file_path) -> PdfForms`

`validate_pdf(file_path) -> Path`

`ValidationError`

Packages