This guide describes how to set up ScienceBeam using Docker, process PDF files to extract text, and convert ALTO XML to plain text using Python.
ScienceBeam runs inside a Docker container. If you havenβt installed Docker yet:
- Download Docker Desktop from https://www.docker.com/products/docker-desktop.
- Install and start Docker.
- Verify Docker is installed by running:
docker --version
Run the following command to pull and start the ScienceBeam container:
docker run -p 8070:8070 --rm elifesciences/sciencebeam-parserThis will start the ScienceBeam server at http://localhost:8070/.
Use curl to send a PDF file for processing:
curl -X POST "http://localhost:8070/api/pdfalto" \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
-o /Users/zwt2000/Desktop/output.jsonInput: A PDF file. Output: ALTO XML format (saved as output.json).
Since ScienceBeam returns ALTO XML, we extract the text using Python.
python3 extract_from_json.pyThis will generate a plain text file (extracted_text_json.txt) on your Desktop.
curl -X POST "http://localhost:8070/api/processFulltextDocument" \
-H "Content-Type: multipart/form-data" \
-F "file=@/Users/zwt2000/Desktop/paper/example/yourfile.pdf" \
-F "output=tei" \
-o /Users/zwt2000/Desktop/output.xmlWe can also extract the text using Python.
python3 extract_from_xml.pyThis will generate a plain text file (extracted_text_xml.txt) on your Desktop.