VOA Dari Web Scraper

A Python-based web scraping project for extracting and processing articles from darivoa.com. The project includes tools for downloading articles, extracting clean text content, and formatting text with proper handling of Persian/Arabic text.

Project Structure

The project consists of four main scripts:

articleDownloader.py: Downloads article HTML from the given list
textExtracter.py: Extracts clean text content from HTML files
format_text.py: Formats and segments the extracted text
wayback_fetcher.py: (Debugging tool) Used to check the availability of Wayback Machine snapshots without downloading them

Setup

Virtual Environment

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On Unix or MacOS:
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

Component Details

Article Downloader

articleDownloader.py Downloads the archived HTML from the wayback machine

Creates 'downloaded_html/date' directory to store raw HTML files
Iterates over snapshot URLs from wayback_fetcher.py
Downloads each HTML file, retrying up to 3 times if there are connection issues
Saves each file in UTF-8 encoding for Persian text
Logs successful and failed downloads
Implements timeouts and delays to prevent server overload

Text Extractor

textExtracter.py processes the downloaded HTML:

Creates 'article_texts' directory for cleaned content
Processes HTML files from 'downloaded_html/date' directory
Removes unwanted elements (scripts, styles, nav menus)
Extracts text from relevant HTML tags
Cleans whitespace and formatting
Saves cleaned text to separate files

Text Formatter

format_text.py provides final text processing:

Uses NLTK for sentence segmentation
Preserves Persian/Arabic characters (Unicode range \u0600-\u06FF)
Maintains punctuation in both English and Persian
Removes unwanted special characters
Formats text with one sentence per line
Preserves proper spacing and formatting

Wayback Machine Fetcher

wayback_fetcher.py Queries the wayback machine for availible snapshots

Queries Wayback Machine for archived URLs.
Lists found snapshots and their HTTP status codes.
Does NOT download any files (for debugging purposes only).

Usage

Set up the virtual environment and install requirements
Run the scripts in sequence:

python articleDownloader.py  # Downloads HTML from wayback machine
#You will be prompted to enter a start date and an end date (both in YYYYMMDD format). The script will fetch and download all available articles within that range.
python textExtracter.py      # Extracts clean article text
python format_text.py        # Formats the extracted text

Output Structure

NLP-Article-Extractor/
├── downloaded_html/    # Raw HTML files from Wayback Machine
│   └── YYYY-MM-DD/
│       └── article_timestamp.html
├── articles/           # Extracted clean text from HTML
│   └── YYYY-MM-DD/
│       └── article_timestamp.txt
├── article_texts/      # Fully formatted and segmented text
│   └── YYYY-MM-DD/
│       └── article_timestamp_formatted.txt
└── scripts/            # Python scripts for each processing step

Features and Enhancements

Data Validation: The program ensures that only past dates are accepted
Wayback Machine Support: Fetches and downloads archived versions of articles.
Filters only HTTP 200 snapshots to avoid broken downloads.
Error Handling: Implements retry logic for failed downloads.
Batch Processing: Organizes files by date for efficient storage and access.
Persian/Arabic Text Processing: Ensures proper text extraction and formatting.

Important Notes

The scripts include delays between requests to avoid overwhelming the server.
All text is processed with UTF-8 encoding for proper Persian text handling.
The NLTK library will download required data on first run.
Ensure you have proper permissions to create directories and files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOA Dari Web Scraper

Project Structure

Setup

Virtual Environment

Component Details

Article Downloader

Text Extractor

Text Formatter

Wayback Machine Fetcher

Usage

Output Structure

Features and Enhancements

Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
article_texts		article_texts
articles		articles
downloaded_html/2025-02-28		downloaded_html/2025-02-28
README.md		README.md
articleDownloader.py		articleDownloader.py
format_text.py		format_text.py
requirements.txt		requirements.txt
snapshots.json		snapshots.json
textExtracter.py		textExtracter.py
wayback_fetcher.py		wayback_fetcher.py

Folders and files

Latest commit

History

Repository files navigation

VOA Dari Web Scraper

Project Structure

Setup

Virtual Environment

Component Details

Article Downloader

Text Extractor

Text Formatter

Wayback Machine Fetcher

Usage

Output Structure

Features and Enhancements

Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages