This project is a small, composable static document scanner implemented in Go.
For a step‑by‑step walkthrough of installation and usage, see Guide.md.
For a detailed list of currently supported document types and how to extend them, see Supported_Document_Types.md.
- Walks a directory tree recursively
- Uses a worker pool to scan files concurrently based on
runtime.NumCPU() - Analyzes files using pluggable analyzers
- Computes SHA256 hashes for every scanned file
- Detects macro-enabled Word documents (
.docx,.docm) by locatingvbaProject.bininside the ZIP structure - Detects suspicious PDF indicators in
.pdffiles using simple heuristic string matching - Emits structured JSON results on stdout
From the docscanner directory (after cloning the repo):
go run ./cmd/scanner <directory>Example (scan the provided samples directory from the repo root):
go run ./cmd/scanner ../samplesThe CLI prints an array of JSON objects, each matching the ScanResult model defined in internal/model/result.go.
To save the JSON output to a file instead of just printing it:
go run ./cmd/scanner ../samples > results.jsonYou can then open results.json in your editor.
For more detailed usage patterns (different directories, troubleshooting, etc.) see Guide.md.
Out of the box, the scanner understands:
- Microsoft Word:
.docx,.docm - PDF:
.pdf
Other file types are walked but ignored unless you implement and register a new analyzer. See Supported_Document_Types.md for details.
- New document types can be added by implementing the
Analyzerinterface ininternal/analyzer/analyzer.go. - The worker pool and directory walker do not need to change when new analyzers are introduced.
This is intentionally a foundation: a minimal but solid base to grow more advanced detection logic.
The code is structured so the core scanning logic under internal/ can be reused from a long‑running backend service (for example, an HTTP API). A typical next step is to add a cmd/server entrypoint that:
- Listens on an HTTP port
- Accepts scan requests (e.g. directory paths or uploaded files)
- Invokes the existing walker, worker pool, and analyzers to produce
ScanResultJSON
Once such a server entrypoint exists, you can deploy it to platforms like Render as a Go web service.
High-level flow:
main.go
├─ parses CLI args
├─ creates channels (files, results)
├─ builds analyzers []Analyzer
├─ starts directory walker (WalkDirectory)
├─ starts worker pool (StartWorkerPool)
└─ aggregates []ScanResult and prints JSON
WalkDirectory (internal/scanner/walker.go)
└─ walks the filesystem and pushes file paths into files chan
StartWorkerPool (internal/scanner/workerpool.go)
└─ spins up N workers
└─ for each file:
├─ os.ReadFile
├─ pick matching Analyzer via Supports
└─ Analyzer.Analyze → *ScanResult → results chan
Analyzers (internal/analyzer/*.go)
├─ WordAnalyzer – detects vbaProject.bin in .docx/.docm
└─ PDFAnalyzer – scans for suspicious PDF keywords
Model (internal/model/result.go)
└─ ScanResult – structure that is serialized to JSON
flowchart TD
CLI["CLI (main.go)"] --> ARGS["Parse args (root dir)"]
ARGS --> CHANNELS["Create channels: files, results"]
CHANNELS --> ANALYZERS["Build analyzers []Analyzer"]
CHANNELS --> WALKER["WalkDirectory (walker.go)"]
WALKER --> FILES["files chan <- file paths"]
FILES --> WORKERPOOL["StartWorkerPool (workerpool.go)"]
WORKERPOOL --> WORKER["N workers"]
WORKER --> READ["os.ReadFile(file)"]
READ --> DISPATCH{"Supports(file)?"}
DISPATCH -->|yes| ANALYZE["Analyzer.Analyze(file, data)"]
ANALYZE --> RESULTS["results chan <- *ScanResult"]
RESULTS --> AGG["Aggregate []ScanResult"]
AGG --> JSON["Marshal JSON & print"]