A modern, scalable news scraping and aggregation platform built with Django REST Framework and powered by Scrapy. Roshan Saba intelligently collects, processes, and delivers news from multiple sources with advanced filtering and task scheduling capabilities.
Website: roshan-ai.ir
- Features
- Technology Stack
- Project Structure
- Prerequisites
- Installation
- Configuration
- Usage
- API Documentation
- Docker Deployment
- Celery Task Scheduling
- Monitoring
- Contributing
- License
- π·οΈ Multi-Source Web Scraping - Integrated Scrapy spiders for efficient data collection
- π‘ RESTful API - Complete REST API for news management and retrieval
- π Asynchronous Task Queue - Powered by Celery with Redis for background processing
- β° Scheduled Scraping - Django-Celery-Beat for periodic scraping tasks
- π Secure - Authentication, encryption, and secure data handling
- π Advanced Filtering - Django-Filter for flexible news filtering and search
- π Scalable Architecture - Containerized with Docker for easy deployment
- π Error Tracking - Integrated Sentry for production monitoring
- π Real-time Updates - WebSocket-ready architecture for live news feeds
- π’οΈ Database Support - MySQL backend for reliable data persistence
- Django 4.2+ - Web framework
- Django REST Framework 3.15+ - RESTful API development
- Scrapy 2.11+ - Web scraping framework
- Celery 5.3+ - Distributed task queue
- Django-Celery-Beat 2.5+ - Periodic task scheduler
- MySQL - Primary data store
- Redis 5.0+ - Caching and Celery broker
- Docker - Containerization
- Gunicorn - WSGI application server
- Python 3.13 - Runtime environment
- Selenium 4.18+ - Browser automation for JavaScript-heavy websites
- Faker 24.0+ - Data generation for testing
- DRF Spectacular - API schema generation
- Flower 2.0+ - Celery monitoring dashboard
- Sentry - Error tracking and monitoring
- Khayyam - Persian date/time utilities
roshan-saba/
βββ news_api/ # API application
β βββ models.py # Data models
β βββ views.py # API endpoints
β βββ serializers.py # DRF serializers
β βββ filters.py # Django filters
β βββ tasks.py # Celery tasks
βββ saba/ # Core Django project
β βββ settings.py # Django settings
β βββ urls.py # URL routing
β βββ celery.py # Celery configuration
β βββ wsgi.py # WSGI configuration
βββ scrapers/ # Scrapy spiders
β βββ spiders/ # Individual scrapers
β βββ pipelines.py # Data processing pipelines
β βββ settings.py # Scrapy configuration
βββ static/ # Static files
βββ docker-compose.yml # Multi-container setup
βββ Dockerfile # Container image definition
βββ entrypoint.sh # Container startup script
βββ requirements.txt # Python dependencies
βββ manage.py # Django management
βββ README.md # This file
Before you begin, ensure you have the following installed:
- Python 3.10+
- Docker & Docker Compose (for containerized deployment)
- Git
Optional but recommended:
- MySQL Server (for local development without Docker)
- Redis Server (for local Celery testing)
git clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-sabapython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtCreate a .env file in the project root:
# Django Settings
DEBUG=True
SECRET_KEY=your-secret-key-here
ALLOWED_HOSTS=localhost,127.0.0.1
# Database
DB_ENGINE=django.db.backends.mysql
DB_NAME=roshan_saba
DB_USER=root
DB_PASSWORD=your_password
DB_HOST=127.0.0.1
DB_PORT=3306
# Redis/Celery
REDIS_URL=redis://127.0.0.1:6379/0
CELERY_BROKER_URL=redis://127.0.0.1:6379/0
CELERY_RESULT_BACKEND=redis://127.0.0.1:6379/0
# Sentry (Optional)
SENTRY_DSN=your-sentry-dsn
# Scrapy Settings
SCRAPY_DOWNLOAD_DELAY=2
SCRAPY_CONCURRENT_REQUESTS=16python manage.py migrate
python manage.py createsuperuserpython manage.py collectstatic --noinputgit clone https://github.com/dwin-gharibi/roshan-saba.git
cd roshan-sabaCreate a .env file (see above for template)
docker-compose up -dThis will start:
- Django application (port 8000)
- MySQL database
- Redis cache
- Celery worker
- Celery Beat scheduler
- Flower monitoring dashboard (port 5555)
docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createsuperuserKey configuration options in saba/settings.py:
# Logging
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'handlers': {
'file': {
'level': 'INFO',
'class': 'logging.FileHandler',
'filename': 'logs/news_scraper.log',
},
},
}
# REST Framework
REST_FRAMEWORK = {
'DEFAULT_FILTER_BACKENDS': ['django_filters.rest_framework.DjangoFilterBackend'],
'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination',
'PAGE_SIZE': 20,
}Located in saba/celery.py:
app.conf.beat_schedule = {
'scrape-news-every-hour': {
'task': 'news_api.tasks.scrape_news',
'schedule': crontab(minute=0), # Every hour
},
}Custom spiders should be created in scrapers/spiders/ directory. Example:
import scrapy
from news_api.models import Article
class NewsSpider(scrapy.Spider):
name = 'news_spider'
allowed_domains = ['news-site.com']
start_urls = ['https://news-site.com']
def parse(self, response):
for article in response.css('article'):
yield {
'title': article.css('h2::text').get(),
'url': article.css('a::attr(href)').get(),
'content': article.css('p::text').get(),
}# Run Django development server
python manage.py runserver
# In another terminal, start Celery worker
celery -A saba worker -l info
# In another terminal, start Celery Beat
celery -A saba beat -l info# Run a specific Scrapy spider
scrapy crawl news_spider
# Run Celery task
python manage.py shell
>>> from news_api.tasks import scrape_news
>>> scrape_news.delay()- Django Admin: http://localhost:8000/admin/
- API Root: http://localhost:8000/api/
- API Documentation: http://localhost:8000/api/schema/
- Flower Dashboard: http://localhost:5555/
http://localhost:8000/api/
GET /api/articles/ # List all articles
POST /api/articles/ # Create new article
GET /api/articles/{id}/ # Retrieve article details
PUT /api/articles/{id}/ # Update article
DELETE /api/articles/{id}/ # Delete article
GET /api/articles/?search=keyword
GET /api/articles/?title__icontains=python
GET /api/articles/?date_published__gte=2024-01-01
curl -H "Authorization: Token YOUR_TOKEN" \
http://localhost:8000/api/articles/?search=news{
"count": 42,
"next": "http://localhost:8000/api/articles/?page=2",
"previous": null,
"results": [
{
"id": 1,
"title": "Breaking News Title",
"content": "Article content here...",
"source": "news-source.com",
"url": "https://news-source.com/article",
"date_published": "2024-03-19T10:30:00Z",
"created_at": "2024-03-19T12:00:00Z"
}
]
}The docker-compose.yml includes:
- web - Django application server
- db - MySQL database
- redis - Cache and Celery broker
- celery_worker - Task execution
- celery_beat - Task scheduling
- flower - Celery monitoring
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f web
# Execute Django commands
docker-compose exec web python manage.py migrate
# Stop services
docker-compose down
# Remove all volumes (WARNING: deletes data)
docker-compose down -vFor production, update docker-compose.yml:
environment:
DEBUG: 'False'
ALLOWED_HOSTS: 'your-domain.com'
SECURE_SSL_REDIRECT: 'True'
SESSION_COOKIE_SECURE: 'True'Located in news_api/tasks.py:
@shared_task
def scrape_news():
"""Scrape news from all configured sources"""
pass
@shared_task
def process_articles():
"""Process and clean scraped articles"""
pass
@shared_task
def send_notifications():
"""Send notifications for new articles"""
passAccess Flower dashboard:
http://localhost:5555/
Features:
- Real-time task monitoring
- Worker status
- Task history
- Performance metrics
Edit saba/celery.py to customize schedules:
from celery.schedules import crontab
app.conf.beat_schedule = {
'scrape-every-hour': {
'task': 'news_api.tasks.scrape_news',
'schedule': crontab(minute=0),
},
'scrape-every-morning': {
'task': 'news_api.tasks.scrape_news',
'schedule': crontab(hour=7, minute=0),
},
}Configure error tracking:
# In settings.py
import sentry_sdk
sentry_sdk.init(
dsn=os.environ.get('SENTRY_DSN'),
integrations=[DjangoIntegration()],
traces_sample_rate=1.0,
send_default_pii=False,
)Check application logs:
# Docker logs
docker-compose logs web
# Local development
tail -f logs/news_scraper.logMonitor Celery tasks:
celery -A saba inspect active
celery -A saba inspect statspython manage.py test# Format code with Black
black .
# Check linting
flake8 .# Create migrations
python manage.py makemigrations
# Apply migrations
python manage.py migrate
# Show migration status
python manage.py showmigrations# Generate new spider template
scrapy genspider my_spider example.comThen edit scrapers/spiders/my_spider.py:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
# Your scraping logic
pass# Test MySQL connection
docker-compose exec web python manage.py dbshell
# Test Redis connection
docker-compose exec redis redis-cli ping# Check logs
docker-compose logs celery_worker
# Restart worker
docker-compose restart celery_worker# Check container resource usage
docker stats
# Adjust in docker-compose.yml
services:
web:
deploy:
resources:
limits:
memory: 1G| Package | Version | Purpose |
|---|---|---|
| Django | 4.2+ | Web framework |
| Scrapy | 2.11+ | Web scraping |
| Celery | 5.3+ | Task queue |
| Django REST Framework | 3.15+ | API development |
| MySQL Client | Latest | Database driver |
| Redis | 5.0+ | Caching/Broker |
| Selenium | 4.18+ | Browser automation |
| Sentry SDK | Latest | Error tracking |
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8
- Write descriptive commit messages
- Include tests for new features
- Update documentation
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or suggestions:
- GitHub Issues: Report a bug
- Website: roshan-ai.ir
Made with β€οΈ by the Dwin Gharibi