Skip to content

generating plain text from PDF files using the ScienceBeam parser

License

Notifications You must be signed in to change notification settings

taoo0316/plain_text_from_PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

plain_text_from_PDF

This guide describes how to set up ScienceBeam using Docker, process PDF files to extract text, and convert ALTO XML to plain text using Python.

πŸš€ Step 1: Install Docker

ScienceBeam runs inside a Docker container. If you haven’t installed Docker yet:

  1. Download Docker Desktop from https://www.docker.com/products/docker-desktop.
  2. Install and start Docker.
  3. Verify Docker is installed by running:
    docker --version

πŸƒ Step 2: Run ScienceBeam (PdfAlto Mode)

Run the following command to pull and start the ScienceBeam container:

docker run -p 8070:8070 --rm elifesciences/sciencebeam-parser

This will start the ScienceBeam server at http://localhost:8070/.

πŸ“‚ Step 3: Process a PDF File

Use curl to send a PDF file for processing:

curl -X POST "http://localhost:8070/api/pdfalto" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -o /Users/zwt2000/Desktop/output.json

Input: A PDF file. Output: ALTO XML format (saved as output.json).

πŸ”„ Step 4: Convert ALTO XML to Plain Text

Since ScienceBeam returns ALTO XML, we extract the text using Python.

python3 extract_from_json.py

This will generate a plain text file (extracted_text_json.txt) on your Desktop.

Alternatively: Process a PDF File as TEI XML (with sections):

curl -X POST "http://localhost:8070/api/processFulltextDocument" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -F "output=tei" \
     -o /Users/zwt2000/Desktop/output.xml

We can also extract the text using Python.

python3 extract_from_xml.py

This will generate a plain text file (extracted_text_xml.txt) on your Desktop.

About

generating plain text from PDF files using the ScienceBeam parser

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages