A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.
- Scrape a website content
- Save to a file in Markdown or HTML format
- Image download support
- Get all images links in the website content
- Get all links in the website content with or without the relative links
- Python 3.10+
To install the package, run the following command:
pip install git+https://github.com/fadhilyori/pywebscraper.gitfrom pywebscraper import PyWebScraper
url = 'https://www.example.com'
scraper = PyWebScraper(url)Note:
The default output directory is in the output directory.
output_file = 'output.md'
scraper.save_markdown(filename=output_file)output_file = 'output.html'
scraper.save_content_html(filename='content.html')output_file = 'output.md'
scraper.save_markdown(filename=output_file, download_images=True)content = scraper.get_content_markdown()
print(content)Example output:
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)images = scraper.extract_images()
print(images)Example output:
[
('alt_text1', 'https://www.example.com/image1.jpg'),
('alt_text2', 'https://www.example.com/image2.jpg'),
]links = scraper.extract_links()
print(links)Example output:
[
# External links
'https://www.example.org/about',
# Relative links
'https://www.example.com/page3', # original: /page3
'https://www.example.com/#section', # original: # #section
'https://www.example.com/?search=python', # original: # ?search=python
]links = scraper.extract_links(include_relative=False)
print(links)Example output:
[
'https://www.example.org/about',
]This project is licensed under the MIT License - see the LICENSE file for details.