A comprehensive data scraping, processing, and integration system built with Python and AWS Lambda. This project consists of multiple specialized modules for web scraping, data transformation, and API integration with Notion and Prom.ua platforms.
- Project Overview
- Architecture
- Technology Stack
- Project Structure
- Modules and Components
- Dependencies
- Setup and Installation
- Deployment
- Usage Examples
- CI/CD Pipeline
- Configuration
- Data Flow
- AWS Resources
- Development Notes
Data Handler is a multi-purpose data processing system designed to automate the collection, transformation, and distribution of data from various sources. The project consists of three primary use cases:
- Real Estate Scraping (Dom.ria): Automated scraping of real estate listings from Dom.ria API
- E-commerce Product Scraping (STN Craft): Web scraping of product information from STN Craft website
- Notion Integration: Data export/import workflows between Notion databases and e-commerce platforms
- Dog Breeding Data: Specialized scraper for collie breeding information with genealogy visualization
The system is built with serverless architecture using AWS Lambda functions, scheduled to run daily via CloudWatch Events, with data storage in Amazon S3.
┌─────────────────────────────────────────────────────────────────┐
│ Data Handler System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ ┌──────────────────┐ │
│ │ GitHub Actions│───────▶│ AWS SAM CLI │ │
│ │ CI/CD Pipeline│ │ Build & Deploy │ │
│ └────────────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ AWS Lambda Functions │ │
│ ├──────────────────────────────────┤ │
│ │ │ │
│ │ ┌──────────────────────────┐ │ │
│ │ │ Dom.ria Scraper │ │ │
│ │ │ (Scheduled: Daily 6AM) │───┼──┐ │
│ │ └──────────────────────────┘ │ │ │
│ │ │ │ │
│ │ ┌──────────────────────────┐ │ │ │
│ │ │ STN Craft Scraper │ │ │ │
│ │ │ (Scheduled: Daily 6AM) │───┼──┤ │
│ │ └──────────────────────────┘ │ │ │
│ │ │ │ │
│ └──────────────────────────────────┘ │ │
│ ▼ │
│ ┌────────────────┐ ┌──────────────────────────────┐ │
│ │ External APIs │ │ Amazon S3 Storage │ │
│ ├────────────────┤ │ (eu-central-1-scraper-data) │ │
│ │ • Dom.ria API │ │ │ │
│ │ • Notion API │ │ /dom-ria/YYYY/MM/DD/ │ │
│ │ • STN Craft │ │ /stn-craft/YYYY/MM/DD/ │ │
│ │ • Collie.com │ └──────────────────────────────┘ │
│ └────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Local Processing Modules │ │
│ ├────────────────────────────────────────┤ │
│ │ • Notion Integration (magic_shop) │ │
│ │ • Collie Dog Data Parser │ │
│ │ • Prom.ua CSV Export │ │
│ └────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
The system follows a modular architecture with clear separation of concerns:
- API Layer: Abstraction for external API integrations (Notion, Prom)
- Scraper Layer: Lambda functions for automated data collection
- Processing Layer: Data transformation and normalization utilities
- Storage Layer: S3-based persistent storage with date-based partitioning
- Language: Python 3.9
- Cloud Platform: Amazon Web Services (AWS)
- Serverless Framework: AWS SAM (Serverless Application Model)
- Package Management: Pipenv
- CI/CD: GitHub Actions
- requests-html (0.10.0): HTML parsing and JavaScript rendering
- httpx (0.22.0): Modern HTTP client with async support
- BeautifulSoup4 (4.10.0): HTML/XML parsing
- lxml (4.7.1): Fast XML/HTML processing
- pyquery (1.4.3): jQuery-like HTML manipulation
- pandas (1.4.0): Data manipulation and analysis
- numpy (1.22.2): Numerical computing
- boto3 (1.20.52): AWS SDK for Python
- aws-lambda-powertools (1.25.0): Lambda utilities and logging
- s3fs: S3 filesystem interface
- fsspec (2022.1.0): Filesystem abstraction
- loguru (0.6.0): Simplified logging
- pydot (1.4.2): Graph visualization (for genealogy trees)
- AWS Lambda: Serverless compute for scrapers
- Amazon S3: Data storage with date-based partitioning
- CloudWatch Events: Scheduled Lambda execution (cron-based)
- AWS CloudFormation: Infrastructure as Code via SAM templates
- GitHub Actions: Automated CI/CD pipeline
data_handler/
│
├── .github/
│ └── workflows/
│ ├── build_dom_ria_scraper.yaml # CI/CD for Dom.ria scraper
│ └── build_stn_craft_scraper.yaml # CI/CD for STN Craft scraper
│
├── api/ # API Integration Layer
│ ├── notion.py # Notion API client
│ └── prom.py # Prom.ua integration
│
├── collie/ # Dog Breeding Data Module
│ ├── main.py # Collie data scraper
│ └── dogs_data.json # Dog breeding data
│
├── dom-ria-scraper-lambda/ # Real Estate Scraper
│ ├── scraper/
│ │ ├── app.py # Lambda handler
│ │ ├── requirements.txt # Dependencies
│ │ └── __init__.py
│ ├── dom.yaml # SAM template
│ ├── README.md # Module documentation
│ └── __init__.py
│
├── magic_shop/ # Notion Data Export Module
│ └── notion_magic_db.py # Notion to Prom export
│
├── stn-craft-scraper-lambda/ # E-commerce Scraper
│ ├── scraper/
│ │ ├── app.py # Lambda handler
│ │ ├── requirements.txt # Dependencies
│ │ ├── templates/
│ │ │ └── prom_import_template.csv # Prom.ua CSV template
│ │ └── __init__.py
│ ├── tests/ # Test suite
│ │ ├── unit/
│ │ ├── integration/
│ │ └── requirements.txt
│ ├── runes.yaml # SAM template
│ ├── README.md # Module documentation
│ └── __init__.py
│
├── Makefile # Build automation
├── Pipfile # Python dependencies
├── Pipfile.lock # Locked dependencies
└── helpers.py # Shared utilities
Purpose: Automated scraping of real estate listings from Dom.ria API
Key Features:
- Rate-limited API requests (1000 requests/hour)
- Pagination handling for large result sets
- Retry logic with timeout handling
- Incremental data export to S3
- Structured logging with AWS Lambda Powertools
Data Collection:
- Flat/apartment listings in Odessa region
- Search criteria: 2-3 rooms, specific metro stations
- Daily scheduled execution at 6:00 AM UTC
Implementation Details:
class DomScraper:
- get_flats_ids_df(): Retrieves all listing IDs with pagination
- get_flat_info(flat_id): Fetches detailed information per listing
- export_flats_ids_to_s3(): Saves listing IDs with checkpoint
- export_flats_info_to_s3(): Saves detailed listing data
- quota_is_not_reached(): Rate limiting managementS3 Storage Pattern: s3://bucket/dom-ria/YYYY/MM/DD/HH/MM/flat_*.csv
Purpose: E-commerce product scraping from STN Craft website
Key Features:
- HTML parsing with requests-html
- Product data extraction (name, price, description, images)
- Prom.ua CSV export format
- Multi-page scraping support
Data Collection:
- Rune sets from product category pages
- Product details: title, price, description, images
- Automatic CSV generation for Prom.ua import
Implementation Details:
class ExportSTNCraft:
- get_links(): Extract links from page with filtering
- get_page(): Fetch page HTML
- parse_product_details(): Extract product information
- export_runes_data(): Main scraping orchestration
class Import2Prom:
- build_prom_csv(): Generate Prom.ua import CSVS3 Storage Pattern: s3://bucket/stn-craft/YYYY/MM/DD/stn_craft_runes_prom.csv
Purpose: Bidirectional data sync with Notion databases
Key Features:
- Database schema introspection
- Type-safe data import with schema validation
- Data export with normalization
- Support for multiple field types (text, select, multiselect, date, files)
Supported Field Types:
- Rich text
- Title
- Select / Multi-select
- Date
- Files / Images
- URL
Implementation Details:
class Notion:
- build_table_request(): Construct Notion API request
- import_data_to_table(): Insert rows into database
- export_data_from_table(): Query and retrieve data
- convert_notion_to_dict(): Normalize Notion response
- get_table_column_names_list(): Schema introspectionPurpose: E-commerce data transformation for Prom.ua platform
Key Features:
- CSV generation following Prom.ua import format
- Data normalization and validation
- Currency and measurement unit standardization
Implementation Details:
class Prom:
- convert_notion2prom(): Transform Notion data to Prom format
- build_prom_csv(): Generate importable CSV filePurpose: Specialized scraper for dog breeding information with genealogy
Key Features:
- Multi-page dog catalog scraping
- Genealogy tree extraction and visualization
- Graph generation with pydot
- Notion database integration for dog records
Data Collection:
- Dog profiles (name, gender, category)
- Images and photo galleries
- Pedigree information (4 generations)
- Descriptive text and characteristics
Implementation Details:
Functions:
- parse_dog_page(): Main orchestration
- get_dog_details(): Extract profile and genealogy
- build_dict_data_from_links(): Organize data structure
- import_data_to_notion_db(): Upload to Notion
- visit(): Graph traversal for genealogy visualizationGenealogy Visualization: Generates PNG graph files using pydot
Purpose: Automated data export from Notion to Prom.ua
Key Features:
- Notion database querying
- Data normalization with pandas
- CSV export with timestamp organization
- Currency and measurement unit injection
Workflow:
- Export data from Notion database
- Normalize field types to flat dictionary
- Add required Prom.ua fields (currency, measure)
- Save to date-structured path
Purpose: Common functionality across modules
Key Functions:
- create_todays_path(): Generate date-based directory structureDate Structure: YYYY/MM/DD hierarchy for organized data storage
[packages]
requests-html = "*" # Web scraping with JavaScript support
pandas = "*" # Data manipulation
httpx = "*" # Modern HTTP client
pydot = "*" # Graph visualization
loguru = "*" # Structured logging
boto3 = "*" # AWS SDK
fsspec = "*" # Filesystem abstraction
aws-lambda-powertools = "*" # Lambda utilities
s3fs = "*" # S3 filesystem interfaceBoth Lambda functions share similar requirements:
aws-lambda-powertools==1.25.0
httpx==0.22.0
pandas==1.4.0
requests-html==0.10.0
boto3==1.20.52
s3fs
loguru==0.6.0
beautifulsoup4==4.10.0
numpy==1.22.2
- Python 3.9
- AWS Account with appropriate permissions
- AWS CLI configured
- Docker (for SAM local testing)
- Git
- Clone the repository:
git clone <repository-url>
cd data_handler- Install pipenv:
pip install pipenv- Install dependencies:
make install-dependencies
# Or manually:
pipenv install --dev- Activate virtual environment:
pipenv shellRIA_API_KEY=<your-dom-ria-api-key>
S3_BUCKET=eu-central-1-scraper-dataS3_BUCKET=eu-central-1-scraper-dataNOTION_TOKEN=<your-notion-integration-token>
NOTION_DB_ID=<your-database-id>- Create S3 bucket:
aws s3 mb s3://eu-central-1-scraper-data --region eu-central-1- Configure AWS credentials:
aws configureThe project includes two GitHub Actions workflows for automated CI/CD:
Dom.ria Scraper Deployment:
- Triggered on push to
mainormasterbranches - Workflow file:
.github/workflows/build_dom_ria_scraper.yaml - Stack name:
dom-scraper
STN Craft Scraper Deployment:
- Triggered on push to
mainormasterbranches - Workflow file:
.github/workflows/build_stn_craft_scraper.yaml - Stack name:
runes-scraper
cd dom-ria-scraper-lambda
sam build --use-container -t dom.yaml
sam package \
--s3-bucket eu-central-1-scraper-data \
--output-template-file packaged.yaml \
--region eu-central-1
sam deploy \
--template-file packaged.yaml \
--stack-name dom-scraper \
--capabilities CAPABILITY_IAM \
--region eu-central-1cd stn-craft-scraper-lambda
sam build --use-container -t runes.yaml
sam package \
--s3-bucket eu-central-1-scraper-data \
--output-template-file packaged.yaml \
--region eu-central-1
sam deploy \
--template-file packaged.yaml \
--stack-name runes-scraper \
--capabilities CAPABILITY_IAM \
--region eu-central-1# Dom.ria Scraper
cd dom-ria-scraper-lambda
sam local invoke DomRiaScraperFunction
# STN Craft Scraper
cd stn-craft-scraper-lambda
sam local invoke STNCraftScraperFunctioncd stn-craft-scraper-lambda
pip install -r tests/requirements.txt
python -m pytest tests/unit -vLambda Handler: The Lambda function runs automatically via CloudWatch Events (daily at 6:00 AM), but can also be invoked manually:
# app.py main execution
if __name__ == "__main__":
dom = DomScraper()
flats_id_df = dom.get_flats_ids_df()
dom.get_flats_data(flats_id_df)Key Features:
- Automatic rate limiting (1000 requests/hour)
- Checkpointing for resumable scraping
- Graceful quota handling with early exit
Lambda Handler:
def lambda_handler(event, context):
parsed_df = ExportSTNCraft().export_runes_data()
Import2Prom().build_prom_csv(parsed_df)
return {
"statusCode": 200,
"body": json.dumps({
"message": "Products successfully parsed and saved to s3 bucket"
}),
}Export from Notion Database:
from api.notion import Notion
notion = Notion(token="your-token")
notion_data = notion.export_data_from_table(_id="database-id")
normalized_data = notion.convert_notion_to_dict(input_data=notion_data)
# Convert to pandas DataFrame
import pandas as pd
df = pd.DataFrame(normalized_data)
df.to_csv('exported_data.csv')Import to Notion Database:
notion = Notion(token="your-token")
# Build request
request_body = notion.build_table_request(
db_id="database-id",
Name={'data_type': 'title', 'value': 'Product Name'},
Price={'data_type': 'text', 'value': '100'},
Category={'data_type': 'select', 'value': 'Electronics'}
)
# Import data
notion.import_data_to_table(request_body)Run the scraper:
from collie.main import parse_dog_page
# Scrape dog data
dog_data = parse_dog_page()
# Data structure:
# {
# "category": {
# "dog_name": {
# "link": "...",
# "details": {
# "images_list": [...],
# "breed_tree": {...},
# "dog_description": [...]
# }
# }
# }
# }Generate genealogy graph:
import pydot
graph = pydot.Dot(graph_type='graph')
visit(dog_data['details']['breed_tree'])
graph.write_png('dog_genealogy.png')Export and convert:
from magic_shop.notion_magic_db import export_data_from_notion
# Exports Notion data and saves as CSV
df = export_data_from_notion()
# Output: data/YYYY/MM/DD/exported_notion_db_data.csvBoth scrapers use identical CI/CD pipeline structure:
Pipeline Steps:
- Checkout: Clone repository
- Setup Python: Install Python 3.9.10
- Setup SAM: Install AWS SAM CLI
- Configure AWS: Authenticate with AWS credentials
- Build: Create Lambda deployment package with Docker
- Package: Upload artifacts to S3
- Deploy: Deploy CloudFormation stack
Secrets Required:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
Deployment Region: eu-central-1
Stack Names:
- Dom.ria:
dom-scraper - STN Craft:
runes-scraper
- Trigger: Push to
mainormasterbranches - Strategy: Full stack replacement
- Rollback: CloudFormation automatic rollback on failure
- Zero Downtime: Lambda versioning with alias updates
Globals:
Function:
Timeout: 900 # 15 minutes
MemorySize: 256 # 256 MB
Environment:
Variables:
RIA_API_KEY: <api-key>
S3_BUCKET: eu-central-1-scraper-data
Resources:
DomRiaScraperFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.9
Handler: app.lambda_handler
Events:
Schedule:
Type: Schedule
Properties:
Schedule: cron(0 6 * * ? *) # Daily at 6 AM UTCGlobals:
Function:
Timeout: 900 # 15 minutes
MemorySize: 256 # 256 MB
Environment:
Variables:
S3_BUCKET: eu-central-1-scraper-data
Resources:
STNCraftScraperFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.9
Handler: app.lambda_handler
Events:
Schedule:
Type: Schedule
Properties:
Schedule: cron(0 6 * * ? *) # Daily at 6 AM UTChelp # Display available commands
install-dependencies # Install pipenv and all dependencies
create-artifact-bucket # Create S3 bucket for SAM artifacts
deploy-stn-craft-scraper # Build and deploy STN Craft scraper1. CloudWatch Event (6:00 AM UTC)
│
▼
2. Lambda Invocation
│
▼
3. Get Listing IDs (paginated)
│ - API Rate Limiting: 1000 req/hour
│ - Checkpoint every page
│
▼
4. For Each Listing ID:
│ - Fetch detailed information
│ - Handle timeouts and retries
│ - Quota monitoring
│
▼
5. Export to S3
│ - Path: /dom-ria/YYYY/MM/DD/HH/MM/
│ - Files: flat_ids_list_*.csv, flats_data_list_*.csv
│
▼
6. Lambda Completion
1. CloudWatch Event (6:00 AM UTC)
│
▼
2. Lambda Invocation
│
▼
3. Scrape Product Pages
│ - Category page: /product-category/rune-sets/
│ - Extract product links
│
▼
4. For Each Product:
│ - Parse product details
│ - Extract images, price, description
│ - Build product dictionary
│
▼
5. Convert to Prom Format
│ - Load CSV template
│ - Normalize prices
│ - Add currency and units
│
▼
6. Export to S3
│ - Path: /stn-craft/YYYY/MM/DD/
│ - File: stn_craft_runes_prom.csv
│
▼
7. Lambda Completion
1. Manual Execution
│
▼
2. Query Notion Database
│ - Database ID from environment
│ - API Version: 2021-08-16
│
▼
3. Convert Notion Types
│ - Rich text → plain text
│ - Select → name
│ - Files → URL list
│
▼
4. Normalize to DataFrame
│ - Flatten nested structures
│ - Handle missing values
│
▼
5. Export or Import
│ - Export: CSV to date-structured path
│ - Import: Build Notion API request
│
▼
6. Completion
Dom.ria Scraper:
- Function Name:
DomRiaScraperFunction - Runtime: Python 3.9
- Memory: 256 MB
- Timeout: 900 seconds (15 minutes)
- Trigger: CloudWatch Events (cron)
- Permissions: S3 write access
STN Craft Scraper:
- Function Name:
STNCraftScraperFunction - Runtime: Python 3.9
- Memory: 256 MB
- Timeout: 900 seconds (15 minutes)
- Trigger: CloudWatch Events (cron)
- Permissions: S3 write access
Bucket Name: eu-central-1-scraper-data
Structure:
s3://eu-central-1-scraper-data/
├── dom-ria/
│ └── YYYY/
│ └── MM/
│ └── DD/
│ └── HH/
│ └── MM/
│ ├── flat_ids_list_HH_page_N.csv
│ └── flats_data_list_HH_page.csv
│
└── stn-craft/
└── YYYY/
└── MM/
└── DD/
└── stn_craft_runes_prom.csv
Dom.ria Schedule:
- Rule Name:
daily-run-dom-ria - Schedule:
cron(0 6 * * ? *) - Description: Daily run at 06:00 AM UTC
- State: Enabled
STN Craft Schedule:
- Rule Name:
daily-run - Schedule:
cron(0 6 * * ? *) - Description: Daily run at 06:00 AM UTC
- State: Enabled
Auto-generated by CloudFormation with least-privilege permissions:
- CloudWatch Logs write access
- S3 bucket read/write access
- Lambda execution role
6cf02f8 - fixed pandas errors
c66c1e3 - fixed pandas errors
64bcbcb - adding env variables
a1e78b0 - adding env variables
cdff19b - adding env variables
ab773bf - adding env variables
d6956b0 - updated: change mkdir to mkdirstring
0f9d38e - changed cron name
7d46ccb - changed pipeline name
435ddf2 - added: dom-ria scraper updated: stn data
- API Keys: Hard-coded API keys present in configuration files (should be rotated before archival)
- Rate Limiting: Dom.ria API limited to 1000 requests/hour
- Pandas Deprecation:
.append()method deprecated in newer pandas versions - Error Handling: Some functions use recursive retry logic which could stack overflow
- S3 Permissions: Lambda functions require appropriate IAM roles for S3 access
- Date-based Partitioning: S3 storage organized by date for easy data lifecycle management
- Serverless Architecture: No infrastructure management, automatic scaling
- Scheduled Execution: Daily runs at 6:00 AM UTC to capture fresh data
- CSV Output Format: Prom.ua-compatible format for direct import
- Checkpointing: Intermediate saves to handle quota limits and timeouts
- Move API keys to AWS Secrets Manager
- Implement SNS notifications for scraper failures
- Add data validation and quality checks
- Implement incremental scraping (delta detection)
- Add CloudWatch dashboards for monitoring
- Upgrade pandas code to use
concat()instead of deprecatedappend() - Add comprehensive error handling and alerting
- Implement data archival policies for S3
- Unit tests exist for STN Craft scraper (
stn-craft-scraper-lambda/tests/) - Integration tests scaffolding present but minimal coverage
- Local testing available via SAM CLI
- No automated test execution in CI/CD pipeline
Dom.ria Scraper:
- Average execution: 10-15 minutes (depends on listing count)
- API calls: ~100-150 per execution
- Data output: 2-5 MB per day
STN Craft Scraper:
- Average execution: 2-5 minutes
- HTTP requests: ~30-50 per execution
- Data output: <1 MB per day
- Lambda Powertools: Structured JSON logging
- Log Groups:
/aws/lambda/DomRiaScraperFunction,/aws/lambda/STNCraftScraperFunction - Metrics: Available via CloudWatch Lambda metrics
- Retention: Default CloudWatch Logs retention
No license information provided in the repository.
ARCHIVED - This repository is being archived. All documentation has been preserved for future reference.
January 5, 2026
This project served as a personal data automation and integration system combining web scraping, cloud infrastructure, and API integrations. The codebase demonstrates serverless architecture patterns, automated data pipelines, and multi-platform data synchronization.
No contact information provided in the repository.
Archive Note: This README was generated for archival purposes and represents the complete state of the project as of the archival date. All code, configurations, and documentation are preserved as-is for historical reference.