URL Scrub

Tool for parsing a URL webpage into JSON + RDF.

Setup

Dependencies

Python: 3.10
geckodriver or chromedriver

Installation Process

Install urlscrub with pip
```
python3.10 -m pip install urlscrub
```
Install geckodriver
- Download Firefox and install.
  - Linux (Ubuntu):
```
sudo apt-get install firefox
```
- Download geckodriver.zip.
- Unzip geckodriver/geckodriver.exe file into a preferred directory.
- Append the directory containing geckodriver to your PATH variable. (Guide)
Install chromedriver
- Download Google Chrome and install.
- Find the version of Google Chrome you have installed.
  - Open Google Chrome web browser.
  - Click on 3 vertical dots at top right. (Picture)
  - At the bottom of the dropdown, select Help, then About Google Chrome. (Picture)
  - Remember the version number displayed (Picture; Ex: 102.0.5005.115)
- Download chromedriver.zip with the most corresponding version number.
  - Exact version number not required (Ex: chromedriver 102.0.5005.61 w/ Google Chrome 102.0.5005.115)
- Unzip chromedriver/chromedriver.exe file into a preferred directory.
- Append the directory containing chromedriver to your PATH variable. (Guide)

Command Line Usage

Command:

urlscrub --skip-cookies --driver "chrome" -l "https://www.amazon.com/All-new-Kindle-Oasis-now-with-adjustable-warm-light/dp/B07GRSK3HC"

Response:

{
  "results": [
    {
      "type": "product",
      "productTitle": "Kindle Oasis \u2013 With adjustable warm light",
      "availability": "In Stock.",
      "rating": "19,734 ratings",
      "imageURL": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SX679_.jpg"
    }
  ]
}

Guides

Appending directories to your PATH environment variable.
- Windows Guide
- Linux:
  - Append path to your .bashrc/.zshrc
```
export PATH="<geckodriver_dir>/:$PATH"
```
Guide to install VcXsrv for running Firefox on WSL2

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
database		database
docs/images		docs/images
examples		examples
package		package
scripts		scripts
test		test
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URL Scrub

Setup

Dependencies

Installation Process

Command Line Usage

Guides

About

Uh oh!

Releases

Packages

Uh oh!

Languages

NeilGraham/urlscrub

Folders and files

Latest commit

History

Repository files navigation

URL Scrub

Setup

Dependencies

Installation Process

Command Line Usage

Guides

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages