This repository contains a set of Python scripts and an HTML viewer designed to scrape all links from a blog, filter them to find potential download links, and display the results in a clean, user-friendly interface.
The project is a complete, three-stage workflow:
- Scrape (
scrape.py): Asynchronously fetches all post URLs from a blog's sitemap and scrapes every link from each page. - Clean (
clean.py): Uses the high-performance Polars library to process the raw scraped data, filtering out unwanted links based on a comprehensive ignore list. - Visualize (
html_csv.html): An interactive, browser-based tool to load the final CSV and view the cleaned links, neatly grouped by their original blog post.
The entire process is designed to be linear and easy to follow. Each script generates an output file that becomes the input for the next step.
graph TD
A[scrape.py] --> B(scraped_links_async.csv);
B --> C[clean.py];
C --> D(filtered_download_links.csv);
D --> E[html_csv.html];
subgraph "1. Scraping"
A
end
subgraph "2. Cleaning"
C
end
subgraph "3. Viewing"
E
end
- Asynchronous Scraping: Uses
httpxandasynciofor fast, concurrent scraping of hundreds of pages without getting blocked. - Powerful Filtering: Leverages
polarsfor high-speed data manipulation and regex-based filtering to remove unwanted links (internal links, social media, ad-fly, etc.). - Interactive Viewer: A zero-dependency (besides your browser) HTML file that parses and displays the final data in a clean, card-based layout.
- Easy Dependency Management: Uses
uv, the extremely fast Python package installer and resolver. - Highly Customizable: Easily change the target sitemap, fine-tune the ignore list, and adjust scraping performance settings.
Before you begin, ensure you have the following installed:
- Python 3.12 or newer.
- uv, the fast Python package manager. You can install it with:
# On macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # On Windows irm https://astral.sh/uv/install.ps1 | iex
-
Clone the repository:
git clone <your-repo-url> cd <your-repo-directory>
-
Install dependencies using
uv: This command creates a virtual environment and installs all the packages listed inpyproject.tomlfrom theuv.lockfile, ensuring a reproducible setup.uv sync
That's it! Your environment is ready.
Run the scripts in order. The magic of uv is that you don't even need to manually activate the virtual environment! uv run handles it for you.
This script reads the sitemap, finds all post URLs, and scrapes every single link from them. This may take a few minutes depending on the number of posts.
uv run python scrape.py- Input:
SITEMAP_URLdefined insidescrape.py. - Output: A new file named
scraped_links_async.csvcontaining the raw, unfiltered data.
This script takes the raw data and applies the filtering rules from the IGNORE_LIST.
uv run python clean.py- Input:
scraped_links_async.csv. - Output: A new file named
filtered_download_links.csvcontaining only the relevant links.
Now you can visualize the clean data.
- Open the
html_csv.htmlfile in your web browser (e.g., Chrome, Firefox, Safari). - Click the "Choose File" button.
- Select the
filtered_download_links.csvfile you just created.
The page will instantly populate with cards, each representing a blog post, listing all the potential download links found within it.
This project is designed to be easily adapted for other blogs or different filtering needs.
To scrape a different blog, simply change the SITEMAP_URL variable in scrape.py:
# scrape.py
# The sitemap URL for the blogspot blog
SITEMAP_URL = "https://some-other-blog.blogspot.com/sitemap.xml"You can also adjust the CONCURRENCY_LIMIT and RANDOM_DELAY_RANGE to be more or less aggressive in your scraping.
The most powerful customization is in the clean.py script. The IGNORE_LIST contains a list of strings. Any link containing one of these strings will be removed.
To add a new domain or pattern to ignore, just add it to the list:
# clean.py
IGNORE_LIST = [
# ... existing rules
"elrincondelkitsune.blogspot.com",
"blogger.com",
"google.com",
# Add your new rule here, for example:
"some-other-unwanted-site.com",
"specific-page.html", # Can also be a partial URL
]pyproject.toml: Project definition file, listing dependencies foruv.uv.lock: A lockfile generated byuvfor reproducible installations.scrape.py: The first step. Asynchronously scrapes a blog's sitemap for all links.clean.py: The second step. Filters the raw links using Polars and a comprehensive ignore list.html_csv.html: The final step. A local webpage to visualize the filtered CSV data.read.py: A simple utility script to quickly test-read the final CSV in the terminal.README.md: This file.
- httpx: A modern, async-capable HTTP client used for all network requests in
scrape.py. - BeautifulSoup4: Used to parse HTML and extract links from the scraped pages.
- Polars: An extremely fast DataFrame library (written in Rust) used in
clean.pyfor efficient data cleaning and filtering. - tqdm: Provides a simple and elegant progress bar for the asynchronous scraping process.