Skip to content
/ syppp Public

A tiny tool to parse pdf files, identify blocks of content and to annotate pdf files visually.

License

Notifications You must be signed in to change notification settings

rumperto/syppp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

syppp

A tiny tool to parse pdf files, identify blocks of content and to annotate pdf files visually.

Getting Started

Setup

In Development

Run ./install.sh

  • Installs OS dependencies.
  • Creates a new poetry lock to fetch latest dependency versions.
  • Runs poetry install.

For quick testing

Run poetry install to just install the python dependencies as defined in the lock file.

Start the Server

run poetry python -m uvicorn api.main_raw:app --host 0.0.0.0 --port 8090 --reload in the root of the project

The first start takes quite a while as different integrations load some transformer models. The completion of the bootstrap will be indicated by this log message: INFO: Application startup complete.

Now open your browser at http://127.0.0.1:8090/docs#/ to use the API.

Use the app

/parse-file gets a pdf file and an eps value to parse the content

  • eps stands for epsilon — it’s a parameter used by the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm and defines the maximum distance in pixels between components.
  • Choosing the right eps is crucial — too small, and nothing clusters; too large, and unrelated elements get grouped.

/annotate is a simple merge of the original pdf file with the output from the /parse-file endpoint. It generates a new pdf files with bounding boxes and comments inside.

About

A tiny tool to parse pdf files, identify blocks of content and to annotate pdf files visually.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published