Extract Anything

A versatile data extraction plugin for the xyOps Workflow Automation System.

This Plugin uses the amazing open-source Kreuzberg library to perform structured data extractions on files. It can pull text and markdown out of PDFs, Word Docs or Powerpoint slides, cell grids or CSV out of Excel spreadsheets, perform OCR inside images to extract text, you name it!

To use this Plugin in your workflows, it requires one or more files as inputs. Once the data is extracted, it is returned as job output data (which is passed to subsequent jobs in linked chains and workflows).

Requirements

docker

Input Formats

This Plugin can extract structured data from many different formats, including:

Office Documents

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
PDF	`.pdf`	`application/pdf`	Native Rust (pdfium-render)	Yes	Metadata extraction, image extraction, text layer detection
Excel	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xlam`, `.xla`, `.ods`	Various Excel MIME types	Native Rust (calamine)	No	Multi-sheet support, formula preservation
PowerPoint	`.pptx`, `.pptm`, `.ppsx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	Native Rust (roxmltree)	Yes (for embedded images)	Slide extraction, image OCR, table detection
Word (Modern)	`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Native Rust	No	Preserves formatting, extracts metadata
Word (Legacy)	`.doc`	`application/msword`	LibreOffice conversion	No	Converts to DOCX then extracts
PowerPoint (Legacy)	`.ppt`	`application/vnd.ms-powerpoint`	LibreOffice conversion	No	Converts to PPTX then extracts
OpenDocument Text	`.odt`	`application/vnd.oasis.opendocument.text`	Native Rust	No	Full OpenDocument support
OpenDocument Spreadsheet	`.ods`	`application/vnd.oasis.opendocument.spreadsheet`	Native Rust (calamine)	No	Multi-sheet support

Text & Markup

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
Plain Text	`.txt`	`text/plain`	Native Rust (streaming)	No	Line/word/character counting, memory-efficient streaming
Markdown	`.md`, `.markdown`	`text/markdown`, `text/x-markdown`	Native Rust (streaming)	No	Header extraction, link detection, code block detection
HTML	`.html`, `.htm`	`text/html`, `application/xhtml+xml`	Native Rust (html-to-markdown-rs)	No	Converts to Markdown, metadata extraction
XML	`.xml`	`application/xml`, `text/xml`	Native Rust (quick-xml streaming)	No	Element counting, unique element tracking
SVG	`.svg`	`image/svg+xml`	Native Rust (XML parser)	No	Treated as XML document
reStructuredText	`.rst`	`text/x-rst`	Native (rst-parser)	No	Full reST syntax support
Org Mode	`.org`	`text/x-org`	Native (org)	No	Emacs Org mode support
Rich Text Format	`.rtf`	`application/rtf`, `text/rtf`	Native (rtf-parser)	No	RTF 1.x support

Structured Data

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
JSON	`.json`	`application/json`, `text/json`	Native Rust (serde_json)	No	Field counting, nested structure extraction
YAML	`.yaml`	`application/x-yaml`, `text/yaml`, `text/x-yaml`	Native Rust (serde_yaml)	No	Multi-document support, field counting
TOML	`.toml`	`application/toml`, `text/toml`	Native Rust (toml crate)	No	Configuration file support
CSV	`.csv`	`text/csv`	Native Rust	No	Tabular data extraction
TSV	`.tsv`	`text/tab-separated-values`	Native Rust	No	Tab-separated data extraction

Email

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
EML	`.eml`	`message/rfc822`	Native Rust (mail-parser)	No	Header extraction, attachment listing, body text
MSG	`.msg`	`application/vnd.ms-outlook`	Native Rust (mail-parser)	No	Outlook message support, metadata extraction

Images

All image formats support OCR when configured with ocr parameter in ExtractionConfig.

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
PNG	`.png`	`image/png`	Native Rust (image-rs)	Yes	EXIF metadata extraction
JPEG	`.jpg`, `.jpeg`	`image/jpeg`, `image/jpg`	Native Rust (image-rs)	Yes	EXIF metadata extraction
WebP	`.webp`	`image/webp`	Native Rust (image-rs)	Yes	Modern format support
BMP	`.bmp`	`image/bmp`, `image/x-bmp`, `image/x-ms-bmp`	Native Rust (image-rs)	Yes	Uncompressed format
TIFF	`.tiff`, `.tif`	`image/tiff`, `image/x-tiff`	Native Rust (image-rs)	Yes	Multi-page support
GIF	`.gif`	`image/gif`	Native Rust (image-rs)	Yes	Animation frame extraction
JPEG 2000	`.jp2`, `.jpx`, `.jpm`, `.mj2`	`image/jp2`, `image/jpx`, `image/jpm`, `image/mj2`	Native Rust (image-rs)	Yes	Advanced JPEG format
PNM Family	`.pnm`, `.pbm`, `.pgm`, `.ppm`	`image/x-portable-anymap`, etc.	Native Rust (image-rs)	Yes	NetPBM formats

Academic & Publishing (Native)

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
LaTeX	`.tex`, `.latex`	`application/x-latex`, `text/x-tex`	Native (manual parser)	No	Full LaTeX document support
EPUB	`.epub`	`application/epub+zip`	Native (zip + roxmltree + html-to-markdown-rs)	No	E-book format, metadata extraction
BibTeX	`.bib`	`application/x-bibtex`, `application/x-biblatex`	Native (biblatex)	No	Bibliography database support
Typst	`.typst`	`application/x-typst`	Native (typst-syntax)	No	Modern typesetting format
Jupyter Notebook	`.ipynb`	`application/x-ipynb+json`	Native (JSON parsing)	No	Code cells, markdown cells, output extraction
FictionBook	-	`application/x-fictionbook+xml`	Native (fb2)	No	XML-based e-book format
DocBook	-	`application/docbook+xml`	Native (roxmltree)	No	Technical documentation format
JATS	-	`application/x-jats+xml`	Native (roxmltree)	No	Journal article XML format
OPML	-	`application/x-opml+xml`	Native (roxmltree)	No	Outline format
RIS	-	`application/x-research-info-systems`	Native (ris-parser)	No	Citation format
EndNote XML	-	`application/x-endnote+xml`	Native (XML parser)	No	Reference manager format
CSL JSON	-	`application/csl+json`	Native (JSON parser)	No	Citation Style Language JSON

Markdown Variants (Native)

Format	MIME Type	Extraction Method	Special Features
CommonMark	`text/x-commonmark`	Native (pulldown-cmark)	Standard Markdown spec
GitHub Flavored Markdown	`text/x-gfm`	Native (pulldown-cmark)	GFM extensions (tables, strikethrough, etc.)
MultiMarkdown	`text/x-multimarkdown`	Native (pulldown-cmark)	MMD extensions
Markdown Extra	`text/x-markdown-extra`	Native (pulldown-cmark)	PHP Markdown Extra extensions

Other Formats

Format	MIME Type	Extraction Method	Special Features
Man Pages	`text/x-mdoc`	Native (mdoc-parser)	Unix manual page format
Troff	`text/troff`	Native (troff-parser)	Unix document format
POD	`text/x-pod`	Native (pod-parser)	Perl documentation format
DokuWiki	`text/x-dokuwiki`	Native (dokuwiki-parser)	Wiki markup format

See Kreuzberg Format Support for more details.

Output Format

Your extracted file data will be in the job output data, which can then be used in linked jobs or attached workflow nodes. The output format looks like this:

{
	"filename": "document.pdf",
	"content": "Denny Gunawan\n221 Queen St\nMelbourne VIC 3000\n123 Somewhere St, Melbourne VIC 3000 $39.60\n(03) 1234 5678\nInvoice Number: #20130304\nOrganic Items Price/kg Quantity(kg) Subtotal\nApple $5.00 1 $5.00\nOrange $1.99 2 $3.98\nWatermelon $1.69 3 $5.07\nMango $9.56 2 $19.12\nPeach $2.99 1 $2.99\nSubtotal $36.00\nGST (10%) $3.60\n* Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales Total $39.60\ndapibus fermentum. Nunc adipiscing, magna sed scelerisque cursus, erat\nlectus dapibus urna, sed facilisis leo dui et ipsum.",
	"mimeType": "application/pdf",
	"metadata": {
		"format_type": "pdf",
		"height": 842,
		"is_encrypted": false,
		"page_count": 1,
		"pdf_version": "1.5",
		"producer": "Prince 16 (www.princexml.com)",
		"quality_score": 1,
		"title": "Sunny Farm Invoice Sample",
		"width": 595
	},
	"tables": [
		{
			"cells": [
				[ "Denny Gunawan" ],
				[ "221 Queen St" ],
				[ "Melbourne VIC 3000" ],
				[ "123 Somewhere St, Melbourne VIC 3000 $39.60" ],
				[ "(03) 1234 5678" ],
				[ "Invoice Number: #20130304" ],
				[ "Organic Items Price/kg Quantity(kg) Subtotal" ],
				[ "Apple $5.00 1 $5.00" ],
				[ "Orange $1.99 2 $3.98" ],
				[ "Watermelon $1.69 3 $5.07" ],
				[ "Mango $9.56 2 $19.12" ],
				[ "Peach $2.99 1 $2.99" ],
				[ "Subtotal $36.00" ],
				[ "GST (10%) $3.60" ],
				[ "* Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales Total $39.60" ],
				[ "dapibus fermentum. Nunc adipiscing, magna sed scelerisque cursus, erat" ],
				[ "lectus dapibus urna, sed facilisis leo dui et ipsum." ]
			],
			"markdown": "| Denny Gunawan |\n| --- |\n| 221 Queen St |\n| Melbourne VIC 3000 |\n| 123 Somewhere St, Melbourne VIC 3000 $39.60 |\n| (03) 1234 5678 |\n| Invoice Number: #20130304 |\n| Organic Items Price/kg Quantity(kg) Subtotal |\n| Apple $5.00 1 $5.00 |\n| Orange $1.99 2 $3.98 |\n| Watermelon $1.69 3 $5.07 |\n| Mango $9.56 2 $19.12 |\n| Peach $2.99 1 $2.99 |\n| Subtotal $36.00 |\n| GST (10%) $3.60 |\n| * Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales Total $39.60 |\n| dapibus fermentum. Nunc adipiscing, magna sed scelerisque cursus, erat |\n| lectus dapibus urna, sed facilisis leo dui et ipsum. |\n",
			"pageNumber": 1
		}
	],
	"detectedLanguages": null,
	"chunks": null,
	"images": null,
	"pages": null
}

When batch mode is enabled, or when multiple files are provided, each file's extractions are pushed onto an extractions array. Example:

{
	"extractions": [
		{
			"filename": "document1.pdf",
			"content": "..."
		},
		{
			"filename": "document2.pdf",
			"content": "..."
		}
	]
}

If you also check the "Attach Files" checkbox in the plugin parameters, this will attach all text output as files when the job completes.

Local Testing

Use this sample JSON to pass a local file into the Plugin (the file should be in the current directory):

{
	"xy": 1,
	"input": {
		"files": [
			{ "filename": "document.pdf" }
		]
	}
}

Example pipe command:

echo '{ "xy":1, "input":{ "files":[ { "filename": "document.pdf" } ] } }' | node index.js

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
document.pdf		document.pdf
index.js		index.js
logo.png		logo.png
package-lock.json		package-lock.json
package.json		package.json
xyops.json		xyops.json

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
ZIP	`.zip`	`application/zip`, `application/x-zip-compressed`	Native Rust (zip crate)	No	File listing, text content extraction
TAR	`.tar`, `.tgz`	`application/x-tar`, `application/tar`, `application/x-gtar`, `application/x-ustar`	Native Rust (tar crate)	No	Unix archive support, compression detection
7-Zip	`.7z`	`application/x-7z-compressed`	Native Rust (sevenz-rust)	No	High compression format support
Gzip	`.gz`	`application/gzip`	Native Rust	No	Gzip compression support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Anything

Requirements

Input Formats

Office Documents

Text & Markup

Structured Data

Email

Images

Archives

Academic & Publishing (Native)

Markdown Variants (Native)

Other Formats

Output Format

Local Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extract Anything

Requirements

Input Formats

Office Documents

Text & Markup

Structured Data

Email

Images

Archives

Academic & Publishing (Native)

Markdown Variants (Native)

Other Formats

Output Format

Local Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages