A tiny tool to parse pdf files, identify blocks of content and to annotate pdf files visually.
Run ./install.sh
- Installs OS dependencies.
- Creates a new poetry lock to fetch latest dependency versions.
- Runs poetry install.
Run poetry install to just install the python dependencies as defined in the lock file.
run poetry python -m uvicorn api.main_raw:app --host 0.0.0.0 --port 8090 --reload in the root of the project
The first start takes quite a while as different integrations load some transformer models. The completion of the bootstrap will be indicated by this log message:
INFO: Application startup complete.
Now open your browser at http://127.0.0.1:8090/docs#/ to use the API.
/parse-file gets a pdf file and an eps value to parse the content
- eps stands for epsilon — it’s a parameter used by the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm and defines the maximum distance in pixels between components.
- Choosing the right eps is crucial — too small, and nothing clusters; too large, and unrelated elements get grouped.
/annotate is a simple merge of the original pdf file with the output from the /parse-file endpoint. It generates a new pdf files with bounding boxes and comments inside.