plain_text_from_PDF

This guide describes how to set up ScienceBeam using Docker, process PDF files to extract text, and convert ALTO XML to plain text using Python.

🚀 Step 1: Install Docker

ScienceBeam runs inside a Docker container. If you haven’t installed Docker yet:

Download Docker Desktop from https://www.docker.com/products/docker-desktop.
Install and start Docker.
Verify Docker is installed by running:
```
docker --version
```

🏃 Step 2: Run ScienceBeam (PdfAlto Mode)

Run the following command to pull and start the ScienceBeam container:

docker run -p 8070:8070 --rm elifesciences/sciencebeam-parser

This will start the ScienceBeam server at http://localhost:8070/.

📂 Step 3: Process a PDF File

Use curl to send a PDF file for processing:

curl -X POST "http://localhost:8070/api/pdfalto" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -o /Users/zwt2000/Desktop/output.json

Input: A PDF file. Output: ALTO XML format (saved as output.json).

🔄 Step 4: Convert ALTO XML to Plain Text

Since ScienceBeam returns ALTO XML, we extract the text using Python.

python3 extract_from_json.py

This will generate a plain text file (extracted_text_json.txt) on your Desktop.

Alternatively: Process a PDF File as TEI XML (with sections):

curl -X POST "http://localhost:8070/api/processFulltextDocument" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
     -F "output=tei" \
     -o /Users/zwt2000/Desktop/output.xml

We can also extract the text using Python.

python3 extract_from_xml.py

This will generate a plain text file (extracted_text_xml.txt) on your Desktop.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
alto		alto
tei		tei
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

plain_text_from_PDF

🚀 Step 1: Install Docker

🏃 Step 2: Run ScienceBeam (PdfAlto Mode)

📂 Step 3: Process a PDF File

🔄 Step 4: Convert ALTO XML to Plain Text

Alternatively: Process a PDF File as TEI XML (with sections):

About

Uh oh!

Releases

Packages

Languages

License

taoo0316/plain_text_from_PDF

Folders and files

Latest commit

History

Repository files navigation

plain_text_from_PDF

🚀 Step 1: Install Docker

🏃 Step 2: Run ScienceBeam (PdfAlto Mode)

📂 Step 3: Process a PDF File

🔄 Step 4: Convert ALTO XML to Plain Text

Alternatively: Process a PDF File as TEI XML (with sections):

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages