A Streamlit-based tool for converting PDF documents to DOCX format using PaddleX and PaddleOCR for layout analysis and text extraction.
This app is build to run on CPU, contributions are welcome for running on CUDA-compatible GPUs.
- Convert PDF documents to DOCX while preserving text content
- Advanced layout detection using PaddleX
- Optical Character Recognition (OCR) with PaddleOCR
- Support for both single-column and two-column document layouts
- Customizable parameters for optimizing extraction quality
- Clean, intuitive user interface built with Streamlit
- Python 3.10
- PaddlePaddle
- PaddleX
- PaddleOCR
- PyMuPDF
- OpenCV
- Streamlit
- Other dependencies listed in requirements.txt
- Create a new conda environment:
conda create -n pdf2docx_paddlex_env python=3.10
conda activate pdf2docx_paddlex_env- Install PaddlePaddle, check PaddlePaddle Install:
python -m pip install paddlepaddle==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/- Install PaddleX, check PaddleX install:
pip install https://paddle-model-ecology.bj.bcebos.com/paddlex/whl/paddlex-3.0.0rc0-py3-none-any.whl- Install other dependencies:
pip install paddleocr==2.10.0 pymupdf opencv-python numpy pillow python-docx streamlit albucore==0.0.16- Run the application:
streamlit run app.py- Configure the Streamlit theme (optional):
Create a file named .streamlit/config.toml with the following content:
[browser]
gatherUsageStats = false
[theme]
base="dark"
primaryColor="#336699"
[server]
maxUploadSize = 512- Upload your PDF files and click "Start Conversion"
- Document Layout: Choose between one-column or two-column layouts
- DPI for PDF Rendering: Higher values give better quality but require more processing time
- Margin Size: Extra padding around detected text regions
- Confidence Threshold: Minimum confidence score for text detection
- Box Overlap Threshold: Controls removal of overlapping text regions
- Box Types: Select which types of content to extract (text, titles, footnotes, etc.)
- Padding around the boxes is necessary as if the box image is too tight when passed on for OCR, letters are read improperly, e.g. "Altay" will be read as "Altav", cutting the tail of the letter "y"
- Overlap percentage measures if two boxes share high amout of area, as the box creation is imperfect, box coordinates are not always perfectly aligned to be inside one another, so using only box coordinates to determine box overlap is not enough
- The setting "2 Columns" works by assigning each box to one of three categories - Monolithic (spanning the whole page width) , Column 1 or Column 2, based on where the middle point of the box is compared to the middle point of the page. The "Monolithic Threshold (%)" slider allows for some flexibility on box designation, as the Monolithic boxes might not be perfectly aligned with the page middle point
- Great care was taken to purge the memory of any unnecessray objects after each processing step, to allow for PDFs with hundreds or even thousands of pages to be converted.
- Large PDF files may require significant memory and processing time
- Complex layouts with non-standard formatting may not be perfectly preserved
- Mathematical formulas and special characters may not be accurately recognized
- Contributions are welcome! Please feel free to submit a Pull Request.
- Right now, I am also exploring saving objects like tables, charts, grapgh and formulas as images and inserting them back into the DOCX file, while keeping their approximate location.
CC BY-NC 4.0


