PDF Document Extraction

The project combines model-based recognition with program-based computation to accurately understand and extract structured content from PDF documents.

At its core, it leverages deep learning-based document layout inference models alongside multiple algorithms to accurately identify structural elements such as titles, tables, lists, headers, footers, and more. The extracted content is then intelligently segmented into coherent chunks and outputted in well-structured Markdown or HTML formats, enabling more precise and meaningful prompts for LLM when processing PDF documents.

Installation

To install img2table.sharp, you can clone the repository and build the project using .NET.

# Clone the repository
git clone https://github.com/your-username/img2table.sharp.git

# Navigate to the project directory
cd img2table.sharp

# Restore dependencies
dotnet restore

# Build the project
dotnet build

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
python		python
src		src
test		test
ui/img2table.sharp.desktop		ui/img2table.sharp.desktop
web/img2table.sharp.web		web/img2table.sharp.web
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
img2table.sharp.sln		img2table.sharp.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Document Extraction

Table of Contents

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Document Extraction

Table of Contents

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages