PyWebScraper

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

Features

Scrape a website content
Save to a file in Markdown or HTML format
Image download support
Get all images links in the website content
Get all links in the website content with or without the relative links

Requirements

Python 3.10+

Installation

To install the package, run the following command:

pip install git+https://github.com/fadhilyori/pywebscraper.git

Usage

Initialize the PyWebScraper class

from pywebscraper import PyWebScraper

url = 'https://www.example.com'

scraper = PyWebScraper(url)

Note:
The default output directory is in the output directory.

Scrape a website content and save to a file in Markdown format

output_file = 'output.md'
scraper.save_markdown(filename=output_file)

Scrape a website content and save to a file in HTML format

output_file = 'output.html'
scraper.save_content_html(filename='content.html')

Scrape a website content and download images

output_file = 'output.md'
scraper.save_markdown(filename=output_file, download_images=True)

Get the website Markdown content in a string

content = scraper.get_content_markdown()
print(content)

Example output:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

Get all images in the website content

images = scraper.extract_images()
print(images)

Example output:

[
    ('alt_text1', 'https://www.example.com/image1.jpg'),
    ('alt_text2', 'https://www.example.com/image2.jpg'),
]

Get all the links in the content (including the relative links)

links = scraper.extract_links()
print(links)

Example output:

[
    # External links
    'https://www.example.org/about',

    # Relative links
    'https://www.example.com/page3',    # original: /page3
    'https://www.example.com/#section', # original: # #section
    'https://www.example.com/?search=python', # original: # ?search=python
]

Get all the links in the content (exclude the relative links)

links = scraper.extract_links(include_relative=False)
print(links)

Example output:

[
    'https://www.example.org/about',
]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
src/pywebscraper		src/pywebscraper
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyWebScraper

Features

Requirements

Installation

Usage

Initialize the PyWebScraper class

Scrape a website content and save to a file in Markdown format

Scrape a website content and save to a file in HTML format

Scrape a website content and download images

Get the website Markdown content in a string

Get all images in the website content

Get all the links in the content (including the relative links)

Get all the links in the content (exclude the relative links)

License

About

Uh oh!

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages

License

fadhilyori/pywebscraper

Folders and files

Latest commit

History

Repository files navigation

PyWebScraper

Features

Requirements

Installation

Usage

Initialize the PyWebScraper class

Scrape a website content and save to a file in Markdown format

Scrape a website content and save to a file in HTML format

Scrape a website content and download images

Get the website Markdown content in a string

Get all images in the website content

Get all the links in the content (including the relative links)

Get all the links in the content (exclude the relative links)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages