Dockerized pup used for scraping HTML
- run
docker build -t my-pup-scraper .to build the Docker image - run
docker run --rm -e URL='http://www.google.com' -e FILTER='body' my-pup-scraperto create a Docker container and run pup--rm- removes/deletes the container after it finishes running
-e URL='http://www.google.com- sets an environment variable named
URLto the valuehttp://www.google.com(change this to whatever url you need to scrape)
- sets an environment variable named
-e FILTER='body'- sets an environment variable named
FILTERto the valuebody(change this to whatever HTML/CSS selectors you need to scrape)
- sets an environment variable named
Or if you want to just run the Docker image stored in DockerHub ::
- run
docker run --rm -e URL='http://www.google.com' -e FILTER='body' jeffreywallace81/pup-scraper
- If you want to ignore all of the HTML tags and just extract the raw text, you can run the command like this ::
docker run --rm -e URL='http://www.google.com' -e FILTER='body text{}' my-pup-scraper
- For an example of a more complex HTML/CSS selector, go look at the default value for the
FILTERenvironment variable in theDockerfile. REF