jw81 / pup-scraper Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Dockerized pup used for scraping HTML

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Repository files navigation

pup-scraper

Dockerized pup used for scraping HTML

Usage

If you want to build the Docker image yourself and run it ::

run docker build -t my-pup-scraper . to build the Docker image
run docker run --rm -e URL='http://www.google.com' -e FILTER='body' my-pup-scraper to create a Docker container and run pup
- --rm
  - removes/deletes the container after it finishes running
- -e URL='http://www.google.com
  - sets an environment variable named URL to the value http://www.google.com (change this to whatever url you need to scrape)
- -e FILTER='body'
  - sets an environment variable named FILTER to the value body (change this to whatever HTML/CSS selectors you need to scrape)

Or if you want to just run the Docker image stored in DockerHub ::

run docker run --rm -e URL='http://www.google.com' -e FILTER='body' jeffreywallace81/pup-scraper

Notes

If you want to ignore all of the HTML tags and just extract the raw text, you can run the command like this ::
- docker run --rm -e URL='http://www.google.com' -e FILTER='body text{}' my-pup-scraper
For an example of a more complex HTML/CSS selector, go look at the default value for the FILTER environment variable in the Dockerfile. REF

About

Dockerized pup used for scraping HTML

Report repository

Releases

No releases published

Packages

No packages published

Languages

Dockerfile 100.0%