Skip to content

dwin-gharibi/roshan-saba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

79 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Roshan Saba - AI-Powered News Scraper

A modern, scalable news scraping and aggregation platform built with Django REST Framework and powered by Scrapy. Roshan Saba intelligently collects, processes, and delivers news from multiple sources with advanced filtering and task scheduling capabilities.

Website: roshan-ai.ir


πŸ“‹ Table of Contents


✨ Features

Core Capabilities

  • πŸ•·οΈ Multi-Source Web Scraping - Integrated Scrapy spiders for efficient data collection
  • πŸ“‘ RESTful API - Complete REST API for news management and retrieval
  • πŸ”„ Asynchronous Task Queue - Powered by Celery with Redis for background processing
  • ⏰ Scheduled Scraping - Django-Celery-Beat for periodic scraping tasks
  • πŸ” Secure - Authentication, encryption, and secure data handling
  • πŸ“Š Advanced Filtering - Django-Filter for flexible news filtering and search
  • πŸš€ Scalable Architecture - Containerized with Docker for easy deployment
  • πŸ“ˆ Error Tracking - Integrated Sentry for production monitoring
  • 🌐 Real-time Updates - WebSocket-ready architecture for live news feeds
  • πŸ›’οΈ Database Support - MySQL backend for reliable data persistence

πŸ›  Technology Stack

Backend

  • Django 4.2+ - Web framework
  • Django REST Framework 3.15+ - RESTful API development
  • Scrapy 2.11+ - Web scraping framework
  • Celery 5.3+ - Distributed task queue
  • Django-Celery-Beat 2.5+ - Periodic task scheduler

Database & Cache

  • MySQL - Primary data store
  • Redis 5.0+ - Caching and Celery broker

Infrastructure

  • Docker - Containerization
  • Gunicorn - WSGI application server
  • Python 3.13 - Runtime environment

Additional Tools

  • Selenium 4.18+ - Browser automation for JavaScript-heavy websites
  • Faker 24.0+ - Data generation for testing
  • DRF Spectacular - API schema generation
  • Flower 2.0+ - Celery monitoring dashboard
  • Sentry - Error tracking and monitoring
  • Khayyam - Persian date/time utilities

πŸ“ Project Structure

roshan-saba/
β”œβ”€β”€ news_api/                 # API application
β”‚   β”œβ”€β”€ models.py            # Data models
β”‚   β”œβ”€β”€ views.py             # API endpoints
β”‚   β”œβ”€β”€ serializers.py       # DRF serializers
β”‚   β”œβ”€β”€ filters.py           # Django filters
β”‚   └── tasks.py             # Celery tasks
β”œβ”€β”€ saba/                    # Core Django project
β”‚   β”œβ”€β”€ settings.py          # Django settings
β”‚   β”œβ”€β”€ urls.py              # URL routing
β”‚   β”œβ”€β”€ celery.py            # Celery configuration
β”‚   └── wsgi.py              # WSGI configuration
β”œβ”€β”€ scrapers/                # Scrapy spiders
β”‚   β”œβ”€β”€ spiders/             # Individual scrapers
β”‚   β”œβ”€β”€ pipelines.py         # Data processing pipelines
β”‚   └── settings.py          # Scrapy configuration
β”œβ”€β”€ static/                  # Static files
β”œβ”€β”€ docker-compose.yml       # Multi-container setup
β”œβ”€β”€ Dockerfile              # Container image definition
β”œβ”€β”€ entrypoint.sh           # Container startup script
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ manage.py               # Django management
└── README.md               # This file

πŸ“‹ Prerequisites

Before you begin, ensure you have the following installed:

  • Python 3.10+
  • Docker & Docker Compose (for containerized deployment)
  • Git

Optional but recommended:

  • MySQL Server (for local development without Docker)
  • Redis Server (for local Celery testing)

πŸš€ Installation

Option 1: Local Development Setup

1. Clone the Repository

git clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-saba

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Configure Environment

Create a .env file in the project root:

# Django Settings
DEBUG=True
SECRET_KEY=your-secret-key-here
ALLOWED_HOSTS=localhost,127.0.0.1

# Database
DB_ENGINE=django.db.backends.mysql
DB_NAME=roshan_saba
DB_USER=root
DB_PASSWORD=your_password
DB_HOST=127.0.0.1
DB_PORT=3306

# Redis/Celery
REDIS_URL=redis://127.0.0.1:6379/0
CELERY_BROKER_URL=redis://127.0.0.1:6379/0
CELERY_RESULT_BACKEND=redis://127.0.0.1:6379/0

# Sentry (Optional)
SENTRY_DSN=your-sentry-dsn

# Scrapy Settings
SCRAPY_DOWNLOAD_DELAY=2
SCRAPY_CONCURRENT_REQUESTS=16

5. Initialize Database

python manage.py migrate
python manage.py createsuperuser

6. Collect Static Files

python manage.py collectstatic --noinput

Option 2: Docker Deployment (Recommended)

1. Clone the Repository

git clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-saba

2. Configure Environment

Create a .env file (see above for template)

3. Build and Run with Docker Compose

docker-compose up -d

This will start:

  • Django application (port 8000)
  • MySQL database
  • Redis cache
  • Celery worker
  • Celery Beat scheduler
  • Flower monitoring dashboard (port 5555)

4. Initialize Database

docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createsuperuser

βš™οΈ Configuration

Django Settings

Key configuration options in saba/settings.py:

# Logging
LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'file': {
            'level': 'INFO',
            'class': 'logging.FileHandler',
            'filename': 'logs/news_scraper.log',
        },
    },
}

# REST Framework
REST_FRAMEWORK = {
    'DEFAULT_FILTER_BACKENDS': ['django_filters.rest_framework.DjangoFilterBackend'],
    'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination',
    'PAGE_SIZE': 20,
}

Celery Configuration

Located in saba/celery.py:

app.conf.beat_schedule = {
    'scrape-news-every-hour': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(minute=0),  # Every hour
    },
}

Scrapy Spiders

Custom spiders should be created in scrapers/spiders/ directory. Example:

import scrapy
from news_api.models import Article

class NewsSpider(scrapy.Spider):
    name = 'news_spider'
    allowed_domains = ['news-site.com']
    start_urls = ['https://news-site.com']

    def parse(self, response):
        for article in response.css('article'):
            yield {
                'title': article.css('h2::text').get(),
                'url': article.css('a::attr(href)').get(),
                'content': article.css('p::text').get(),
            }

πŸ’» Usage

Starting Development Server

# Run Django development server
python manage.py runserver

# In another terminal, start Celery worker
celery -A saba worker -l info

# In another terminal, start Celery Beat
celery -A saba beat -l info

Running Scrapers

# Run a specific Scrapy spider
scrapy crawl news_spider

# Run Celery task
python manage.py shell
>>> from news_api.tasks import scrape_news
>>> scrape_news.delay()

Accessing the Application


πŸ“‘ API Documentation

Base URL

http://localhost:8000/api/

Endpoints

Articles

GET    /api/articles/                    # List all articles
POST   /api/articles/                    # Create new article
GET    /api/articles/{id}/              # Retrieve article details
PUT    /api/articles/{id}/              # Update article
DELETE /api/articles/{id}/              # Delete article

Search & Filter

GET /api/articles/?search=keyword
GET /api/articles/?title__icontains=python
GET /api/articles/?date_published__gte=2024-01-01

Example Request

curl -H "Authorization: Token YOUR_TOKEN" \
  http://localhost:8000/api/articles/?search=news

Example Response

{
  "count": 42,
  "next": "http://localhost:8000/api/articles/?page=2",
  "previous": null,
  "results": [
    {
      "id": 1,
      "title": "Breaking News Title",
      "content": "Article content here...",
      "source": "news-source.com",
      "url": "https://news-source.com/article",
      "date_published": "2024-03-19T10:30:00Z",
      "created_at": "2024-03-19T12:00:00Z"
    }
  ]
}

🐳 Docker Deployment

Docker Compose Services

The docker-compose.yml includes:

  1. web - Django application server
  2. db - MySQL database
  3. redis - Cache and Celery broker
  4. celery_worker - Task execution
  5. celery_beat - Task scheduling
  6. flower - Celery monitoring

Common Docker Commands

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f web

# Execute Django commands
docker-compose exec web python manage.py migrate

# Stop services
docker-compose down

# Remove all volumes (WARNING: deletes data)
docker-compose down -v

Production Deployment

For production, update docker-compose.yml:

environment:
  DEBUG: 'False'
  ALLOWED_HOSTS: 'your-domain.com'
  SECURE_SSL_REDIRECT: 'True'
  SESSION_COOKIE_SECURE: 'True'

⏰ Celery Task Scheduling

Available Tasks

Located in news_api/tasks.py:

@shared_task
def scrape_news():
    """Scrape news from all configured sources"""
    pass

@shared_task
def process_articles():
    """Process and clean scraped articles"""
    pass

@shared_task
def send_notifications():
    """Send notifications for new articles"""
    pass

Monitoring Tasks

Access Flower dashboard:

http://localhost:5555/

Features:

  • Real-time task monitoring
  • Worker status
  • Task history
  • Performance metrics

Scheduling Configuration

Edit saba/celery.py to customize schedules:

from celery.schedules import crontab

app.conf.beat_schedule = {
    'scrape-every-hour': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(minute=0),
    },
    'scrape-every-morning': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(hour=7, minute=0),
    },
}

πŸ“Š Monitoring

Sentry Integration

Configure error tracking:

# In settings.py
import sentry_sdk

sentry_sdk.init(
    dsn=os.environ.get('SENTRY_DSN'),
    integrations=[DjangoIntegration()],
    traces_sample_rate=1.0,
    send_default_pii=False,
)

Logging

Check application logs:

# Docker logs
docker-compose logs web

# Local development
tail -f logs/news_scraper.log

Performance Monitoring

Monitor Celery tasks:

celery -A saba inspect active
celery -A saba inspect stats

πŸ”§ Development

Running Tests

python manage.py test

Code Style

# Format code with Black
black .

# Check linting
flake8 .

Database Migrations

# Create migrations
python manage.py makemigrations

# Apply migrations
python manage.py migrate

# Show migration status
python manage.py showmigrations

Creating New Spiders

# Generate new spider template
scrapy genspider my_spider example.com

Then edit scrapers/spiders/my_spider.py:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    def parse(self, response):
        # Your scraping logic
        pass

πŸ› Troubleshooting

Connection Issues

# Test MySQL connection
docker-compose exec web python manage.py dbshell

# Test Redis connection
docker-compose exec redis redis-cli ping

Celery Worker Not Running

# Check logs
docker-compose logs celery_worker

# Restart worker
docker-compose restart celery_worker

Memory Issues

# Check container resource usage
docker stats

# Adjust in docker-compose.yml
services:
  web:
    deploy:
      resources:
        limits:
          memory: 1G

πŸ“¦ Dependencies Overview

Package Version Purpose
Django 4.2+ Web framework
Scrapy 2.11+ Web scraping
Celery 5.3+ Task queue
Django REST Framework 3.15+ API development
MySQL Client Latest Database driver
Redis 5.0+ Caching/Broker
Selenium 4.18+ Browser automation
Sentry SDK Latest Error tracking

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Standards

  • Follow PEP 8
  • Write descriptive commit messages
  • Include tests for new features
  • Update documentation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ‘₯ Support & Contact

For issues, questions, or suggestions:


πŸ“š Additional Resources


Made with ❀️ by the Dwin Gharibi

About

A modern, scalable news scraping and aggregation platform built with Django REST Framework and powered by Scrapy, Roshan Saba intelligently collects, processes, and delivers news from multiple sources with advanced filtering and task scheduling capabilities.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages