Skip to content

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

License

Notifications You must be signed in to change notification settings

fadhilyori/pywebscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyWebScraper

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

Features

  • Scrape a website content
  • Save to a file in Markdown or HTML format
  • Image download support
  • Get all images links in the website content
  • Get all links in the website content with or without the relative links

Requirements

  • Python 3.10+

Installation

To install the package, run the following command:

pip install git+https://github.com/fadhilyori/pywebscraper.git

Usage

Initialize the PyWebScraper class

from pywebscraper import PyWebScraper

url = 'https://www.example.com'

scraper = PyWebScraper(url)

Note:
The default output directory is in the output directory.

Scrape a website content and save to a file in Markdown format

output_file = 'output.md'
scraper.save_markdown(filename=output_file)

Scrape a website content and save to a file in HTML format

output_file = 'output.html'
scraper.save_content_html(filename='content.html')

Scrape a website content and download images

output_file = 'output.md'
scraper.save_markdown(filename=output_file, download_images=True)

Get the website Markdown content in a string

content = scraper.get_content_markdown()
print(content)

Example output:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

Get all images in the website content

images = scraper.extract_images()
print(images)

Example output:

[
    ('alt_text1', 'https://www.example.com/image1.jpg'),
    ('alt_text2', 'https://www.example.com/image2.jpg'),
]

Get all the links in the content (including the relative links)

links = scraper.extract_links()
print(links)

Example output:

[
    # External links
    'https://www.example.org/about',

    # Relative links
    'https://www.example.com/page3',    # original: /page3
    'https://www.example.com/#section', # original: # #section
    'https://www.example.com/?search=python', # original: # ?search=python
]

Get all the links in the content (exclude the relative links)

links = scraper.extract_links(include_relative=False)
print(links)

Example output:

[
    'https://www.example.org/about',
]

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages