🔍 Roshan Saba - AI-Powered News Scraper

A modern, scalable news scraping and aggregation platform built with Django REST Framework and powered by Scrapy. Roshan Saba intelligently collects, processes, and delivers news from multiple sources with advanced filtering and task scheduling capabilities.

Website: roshan-ai.ir

✨ Features

Core Capabilities

🕷️ Multi-Source Web Scraping - Integrated Scrapy spiders for efficient data collection
📡 RESTful API - Complete REST API for news management and retrieval
🔄 Asynchronous Task Queue - Powered by Celery with Redis for background processing
⏰ Scheduled Scraping - Django-Celery-Beat for periodic scraping tasks
🔐 Secure - Authentication, encryption, and secure data handling
📊 Advanced Filtering - Django-Filter for flexible news filtering and search
🚀 Scalable Architecture - Containerized with Docker for easy deployment
📈 Error Tracking - Integrated Sentry for production monitoring
🌐 Real-time Updates - WebSocket-ready architecture for live news feeds
🛢️ Database Support - MySQL backend for reliable data persistence

🛠 Technology Stack

Backend

Django 4.2+ - Web framework
Django REST Framework 3.15+ - RESTful API development
Scrapy 2.11+ - Web scraping framework
Celery 5.3+ - Distributed task queue
Django-Celery-Beat 2.5+ - Periodic task scheduler

Database & Cache

MySQL - Primary data store
Redis 5.0+ - Caching and Celery broker

Infrastructure

Docker - Containerization
Gunicorn - WSGI application server
Python 3.13 - Runtime environment

Additional Tools

Selenium 4.18+ - Browser automation for JavaScript-heavy websites
Faker 24.0+ - Data generation for testing
DRF Spectacular - API schema generation
Flower 2.0+ - Celery monitoring dashboard
Sentry - Error tracking and monitoring
Khayyam - Persian date/time utilities

📁 Project Structure

roshan-saba/
├── news_api/                 # API application
│   ├── models.py            # Data models
│   ├── views.py             # API endpoints
│   ├── serializers.py       # DRF serializers
│   ├── filters.py           # Django filters
│   └── tasks.py             # Celery tasks
├── saba/                    # Core Django project
│   ├── settings.py          # Django settings
│   ├── urls.py              # URL routing
│   ├── celery.py            # Celery configuration
│   └── wsgi.py              # WSGI configuration
├── scrapers/                # Scrapy spiders
│   ├── spiders/             # Individual scrapers
│   ├── pipelines.py         # Data processing pipelines
│   └── settings.py          # Scrapy configuration
├── static/                  # Static files
├── docker-compose.yml       # Multi-container setup
├── Dockerfile              # Container image definition
├── entrypoint.sh           # Container startup script
├── requirements.txt        # Python dependencies
├── manage.py               # Django management
└── README.md               # This file

📋 Prerequisites

Before you begin, ensure you have the following installed:

Python 3.10+
Docker & Docker Compose (for containerized deployment)
Git

Optional but recommended:

MySQL Server (for local development without Docker)
Redis Server (for local Celery testing)

🚀 Installation

Option 1: Local Development Setup

1. Clone the Repository

git clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-saba

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Configure Environment

Create a .env file in the project root:

# Django Settings
DEBUG=True
SECRET_KEY=your-secret-key-here
ALLOWED_HOSTS=localhost,127.0.0.1

# Database
DB_ENGINE=django.db.backends.mysql
DB_NAME=roshan_saba
DB_USER=root
DB_PASSWORD=your_password
DB_HOST=127.0.0.1
DB_PORT=3306

# Redis/Celery
REDIS_URL=redis://127.0.0.1:6379/0
CELERY_BROKER_URL=redis://127.0.0.1:6379/0
CELERY_RESULT_BACKEND=redis://127.0.0.1:6379/0

# Sentry (Optional)
SENTRY_DSN=your-sentry-dsn

# Scrapy Settings
SCRAPY_DOWNLOAD_DELAY=2
SCRAPY_CONCURRENT_REQUESTS=16

5. Initialize Database

python manage.py migrate
python manage.py createsuperuser

6. Collect Static Files

python manage.py collectstatic --noinput

Option 2: Docker Deployment (Recommended)

1. Clone the Repository

git clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-saba

2. Configure Environment

Create a .env file (see above for template)

3. Build and Run with Docker Compose

docker-compose up -d

This will start:

Django application (port 8000)
MySQL database
Redis cache
Celery worker
Celery Beat scheduler
Flower monitoring dashboard (port 5555)

4. Initialize Database

docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createsuperuser

⚙️ Configuration

Django Settings

Key configuration options in saba/settings.py:

# Logging
LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'file': {
            'level': 'INFO',
            'class': 'logging.FileHandler',
            'filename': 'logs/news_scraper.log',
        },
    },
}

# REST Framework
REST_FRAMEWORK = {
    'DEFAULT_FILTER_BACKENDS': ['django_filters.rest_framework.DjangoFilterBackend'],
    'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination',
    'PAGE_SIZE': 20,
}

Celery Configuration

Located in saba/celery.py:

app.conf.beat_schedule = {
    'scrape-news-every-hour': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(minute=0),  # Every hour
    },
}

Scrapy Spiders

Custom spiders should be created in scrapers/spiders/ directory. Example:

import scrapy
from news_api.models import Article

class NewsSpider(scrapy.Spider):
    name = 'news_spider'
    allowed_domains = ['news-site.com']
    start_urls = ['https://news-site.com']

    def parse(self, response):
        for article in response.css('article'):
            yield {
                'title': article.css('h2::text').get(),
                'url': article.css('a::attr(href)').get(),
                'content': article.css('p::text').get(),
            }

💻 Usage

Starting Development Server

# Run Django development server
python manage.py runserver

# In another terminal, start Celery worker
celery -A saba worker -l info

# In another terminal, start Celery Beat
celery -A saba beat -l info

Running Scrapers

# Run a specific Scrapy spider
scrapy crawl news_spider

# Run Celery task
python manage.py shell
>>> from news_api.tasks import scrape_news
>>> scrape_news.delay()

Accessing the Application

Django Admin: http://localhost:8000/admin/
API Root: http://localhost:8000/api/
API Documentation: http://localhost:8000/api/schema/
Flower Dashboard: http://localhost:5555/

📡 API Documentation

Base URL

http://localhost:8000/api/

Endpoints

Articles

GET    /api/articles/                    # List all articles
POST   /api/articles/                    # Create new article
GET    /api/articles/{id}/              # Retrieve article details
PUT    /api/articles/{id}/              # Update article
DELETE /api/articles/{id}/              # Delete article

Search & Filter

GET /api/articles/?search=keyword
GET /api/articles/?title__icontains=python
GET /api/articles/?date_published__gte=2024-01-01

Example Request

curl -H "Authorization: Token YOUR_TOKEN" \
  http://localhost:8000/api/articles/?search=news

Example Response

{
  "count": 42,
  "next": "http://localhost:8000/api/articles/?page=2",
  "previous": null,
  "results": [
    {
      "id": 1,
      "title": "Breaking News Title",
      "content": "Article content here...",
      "source": "news-source.com",
      "url": "https://news-source.com/article",
      "date_published": "2024-03-19T10:30:00Z",
      "created_at": "2024-03-19T12:00:00Z"
    }
  ]
}

🐳 Docker Deployment

Docker Compose Services

The docker-compose.yml includes:

web - Django application server
db - MySQL database
redis - Cache and Celery broker
celery_worker - Task execution
celery_beat - Task scheduling
flower - Celery monitoring

Common Docker Commands

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f web

# Execute Django commands
docker-compose exec web python manage.py migrate

# Stop services
docker-compose down

# Remove all volumes (WARNING: deletes data)
docker-compose down -v

Production Deployment

For production, update docker-compose.yml:

environment:
  DEBUG: 'False'
  ALLOWED_HOSTS: 'your-domain.com'
  SECURE_SSL_REDIRECT: 'True'
  SESSION_COOKIE_SECURE: 'True'

⏰ Celery Task Scheduling

Available Tasks

Located in news_api/tasks.py:

@shared_task
def scrape_news():
    """Scrape news from all configured sources"""
    pass

@shared_task
def process_articles():
    """Process and clean scraped articles"""
    pass

@shared_task
def send_notifications():
    """Send notifications for new articles"""
    pass

Monitoring Tasks

Access Flower dashboard:

http://localhost:5555/

Features:

Real-time task monitoring
Worker status
Task history
Performance metrics

Scheduling Configuration

Edit saba/celery.py to customize schedules:

from celery.schedules import crontab

app.conf.beat_schedule = {
    'scrape-every-hour': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(minute=0),
    },
    'scrape-every-morning': {
        'task': 'news_api.tasks.scrape_news',
        'schedule': crontab(hour=7, minute=0),
    },
}

📊 Monitoring

Sentry Integration

Configure error tracking:

# In settings.py
import sentry_sdk

sentry_sdk.init(
    dsn=os.environ.get('SENTRY_DSN'),
    integrations=[DjangoIntegration()],
    traces_sample_rate=1.0,
    send_default_pii=False,
)

Logging

Check application logs:

# Docker logs
docker-compose logs web

# Local development
tail -f logs/news_scraper.log

Performance Monitoring

Monitor Celery tasks:

celery -A saba inspect active
celery -A saba inspect stats

🔧 Development

Running Tests

python manage.py test

Code Style

# Format code with Black
black .

# Check linting
flake8 .

Database Migrations

# Create migrations
python manage.py makemigrations

# Apply migrations
python manage.py migrate

# Show migration status
python manage.py showmigrations

Creating New Spiders

# Generate new spider template
scrapy genspider my_spider example.com

Then edit scrapers/spiders/my_spider.py:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    def parse(self, response):
        # Your scraping logic
        pass

🐛 Troubleshooting

Connection Issues

# Test MySQL connection
docker-compose exec web python manage.py dbshell

# Test Redis connection
docker-compose exec redis redis-cli ping

Celery Worker Not Running

# Check logs
docker-compose logs celery_worker

# Restart worker
docker-compose restart celery_worker

Memory Issues

# Check container resource usage
docker stats

# Adjust in docker-compose.yml
services:
  web:
    deploy:
      resources:
        limits:
          memory: 1G

📦 Dependencies Overview

Package	Version	Purpose
Django	4.2+	Web framework
Scrapy	2.11+	Web scraping
Celery	5.3+	Task queue
Django REST Framework	3.15+	API development
MySQL Client	Latest	Database driver
Redis	5.0+	Caching/Broker
Selenium	4.18+	Browser automation
Sentry SDK	Latest	Error tracking

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Standards

Follow PEP 8
Write descriptive commit messages
Include tests for new features
Update documentation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Support & Contact

For issues, questions, or suggestions:

GitHub Issues: Report a bug
Website: roshan-ai.ir

📚 Additional Resources

Made with ❤️ by the Dwin Gharibi

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
news_api		news_api
saba		saba
static		static
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
manage.py		manage.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔍 Roshan Saba - AI-Powered News Scraper

📋 Table of Contents

✨ Features

Core Capabilities

🛠 Technology Stack

Backend

Database & Cache

Infrastructure

Additional Tools

📁 Project Structure

📋 Prerequisites

🚀 Installation

Option 1: Local Development Setup

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment

5. Initialize Database

6. Collect Static Files

Option 2: Docker Deployment (Recommended)

1. Clone the Repository

2. Configure Environment

3. Build and Run with Docker Compose

4. Initialize Database

⚙️ Configuration

Django Settings

Celery Configuration

Scrapy Spiders

💻 Usage

Starting Development Server

Running Scrapers

Accessing the Application

📡 API Documentation

Base URL

Endpoints

Articles

Search & Filter

Example Request

Example Response

🐳 Docker Deployment

Docker Compose Services

Common Docker Commands

Production Deployment

⏰ Celery Task Scheduling

Available Tasks

Monitoring Tasks

Scheduling Configuration

📊 Monitoring

Sentry Integration

Logging

Performance Monitoring

🔧 Development

Running Tests

Code Style

Database Migrations

Creating New Spiders

🐛 Troubleshooting

Connection Issues

Celery Worker Not Running

Memory Issues

📦 Dependencies Overview

🤝 Contributing

Code Standards

📄 License

👥 Support & Contact

📚 Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages