Data Handler

A comprehensive data scraping, processing, and integration system built with Python and AWS Lambda. This project consists of multiple specialized modules for web scraping, data transformation, and API integration with Notion and Prom.ua platforms.

Project Overview

Data Handler is a multi-purpose data processing system designed to automate the collection, transformation, and distribution of data from various sources. The project consists of three primary use cases:

Real Estate Scraping (Dom.ria): Automated scraping of real estate listings from Dom.ria API
E-commerce Product Scraping (STN Craft): Web scraping of product information from STN Craft website
Notion Integration: Data export/import workflows between Notion databases and e-commerce platforms
Dog Breeding Data: Specialized scraper for collie breeding information with genealogy visualization

The system is built with serverless architecture using AWS Lambda functions, scheduled to run daily via CloudWatch Events, with data storage in Amazon S3.

Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Data Handler System                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌────────────────┐        ┌──────────────────┐                │
│  │  GitHub Actions│───────▶│  AWS SAM CLI     │                │
│  │  CI/CD Pipeline│        │  Build & Deploy  │                │
│  └────────────────┘        └──────────────────┘                │
│                                     │                            │
│                                     ▼                            │
│              ┌──────────────────────────────────┐               │
│              │   AWS Lambda Functions           │               │
│              ├──────────────────────────────────┤               │
│              │                                  │               │
│              │  ┌──────────────────────────┐   │               │
│              │  │ Dom.ria Scraper         │   │               │
│              │  │ (Scheduled: Daily 6AM)  │───┼──┐            │
│              │  └──────────────────────────┘   │  │            │
│              │                                  │  │            │
│              │  ┌──────────────────────────┐   │  │            │
│              │  │ STN Craft Scraper       │   │  │            │
│              │  │ (Scheduled: Daily 6AM)  │───┼──┤            │
│              │  └──────────────────────────┘   │  │            │
│              │                                  │  │            │
│              └──────────────────────────────────┘  │            │
│                                                     ▼            │
│  ┌────────────────┐        ┌──────────────────────────────┐    │
│  │  External APIs │        │     Amazon S3 Storage        │    │
│  ├────────────────┤        │  (eu-central-1-scraper-data) │    │
│  │ • Dom.ria API  │        │                              │    │
│  │ • Notion API   │        │  /dom-ria/YYYY/MM/DD/        │    │
│  │ • STN Craft    │        │  /stn-craft/YYYY/MM/DD/      │    │
│  │ • Collie.com   │        └──────────────────────────────┘    │
│  └────────────────┘                                             │
│         │                                                        │
│         ▼                                                        │
│  ┌────────────────────────────────────────┐                    │
│  │      Local Processing Modules          │                    │
│  ├────────────────────────────────────────┤                    │
│  │ • Notion Integration (magic_shop)      │                    │
│  │ • Collie Dog Data Parser               │                    │
│  │ • Prom.ua CSV Export                   │                    │
│  └────────────────────────────────────────┘                    │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Component Architecture

The system follows a modular architecture with clear separation of concerns:

API Layer: Abstraction for external API integrations (Notion, Prom)
Scraper Layer: Lambda functions for automated data collection
Processing Layer: Data transformation and normalization utilities
Storage Layer: S3-based persistent storage with date-based partitioning

Technology Stack

Core Technologies

Language: Python 3.9
Cloud Platform: Amazon Web Services (AWS)
Serverless Framework: AWS SAM (Serverless Application Model)
Package Management: Pipenv
CI/CD: GitHub Actions

Key Libraries and Frameworks

Web Scraping

requests-html (0.10.0): HTML parsing and JavaScript rendering
httpx (0.22.0): Modern HTTP client with async support
BeautifulSoup4 (4.10.0): HTML/XML parsing
lxml (4.7.1): Fast XML/HTML processing
pyquery (1.4.3): jQuery-like HTML manipulation

Data Processing

pandas (1.4.0): Data manipulation and analysis
numpy (1.22.2): Numerical computing

AWS Integration

boto3 (1.20.52): AWS SDK for Python
aws-lambda-powertools (1.25.0): Lambda utilities and logging
s3fs: S3 filesystem interface
fsspec (2022.1.0): Filesystem abstraction

Utilities

loguru (0.6.0): Simplified logging
pydot (1.4.2): Graph visualization (for genealogy trees)

Infrastructure

AWS Lambda: Serverless compute for scrapers
Amazon S3: Data storage with date-based partitioning
CloudWatch Events: Scheduled Lambda execution (cron-based)
AWS CloudFormation: Infrastructure as Code via SAM templates
GitHub Actions: Automated CI/CD pipeline

Project Structure

data_handler/
│
├── .github/
│   └── workflows/
│       ├── build_dom_ria_scraper.yaml      # CI/CD for Dom.ria scraper
│       └── build_stn_craft_scraper.yaml    # CI/CD for STN Craft scraper
│
├── api/                                     # API Integration Layer
│   ├── notion.py                           # Notion API client
│   └── prom.py                             # Prom.ua integration
│
├── collie/                                  # Dog Breeding Data Module
│   ├── main.py                             # Collie data scraper
│   └── dogs_data.json                      # Dog breeding data
│
├── dom-ria-scraper-lambda/                 # Real Estate Scraper
│   ├── scraper/
│   │   ├── app.py                          # Lambda handler
│   │   ├── requirements.txt                # Dependencies
│   │   └── __init__.py
│   ├── dom.yaml                            # SAM template
│   ├── README.md                           # Module documentation
│   └── __init__.py
│
├── magic_shop/                             # Notion Data Export Module
│   └── notion_magic_db.py                  # Notion to Prom export
│
├── stn-craft-scraper-lambda/               # E-commerce Scraper
│   ├── scraper/
│   │   ├── app.py                          # Lambda handler
│   │   ├── requirements.txt                # Dependencies
│   │   ├── templates/
│   │   │   └── prom_import_template.csv    # Prom.ua CSV template
│   │   └── __init__.py
│   ├── tests/                              # Test suite
│   │   ├── unit/
│   │   ├── integration/
│   │   └── requirements.txt
│   ├── runes.yaml                          # SAM template
│   ├── README.md                           # Module documentation
│   └── __init__.py
│
├── Makefile                                # Build automation
├── Pipfile                                 # Python dependencies
├── Pipfile.lock                            # Locked dependencies
└── helpers.py                              # Shared utilities

Modules and Components

1. Dom.ria Scraper Lambda (`dom-ria-scraper-lambda/`)

Purpose: Automated scraping of real estate listings from Dom.ria API

Key Features:

Rate-limited API requests (1000 requests/hour)
Pagination handling for large result sets
Retry logic with timeout handling
Incremental data export to S3
Structured logging with AWS Lambda Powertools

Data Collection:

Flat/apartment listings in Odessa region
Search criteria: 2-3 rooms, specific metro stations
Daily scheduled execution at 6:00 AM UTC

Implementation Details:

class DomScraper:
    - get_flats_ids_df(): Retrieves all listing IDs with pagination
    - get_flat_info(flat_id): Fetches detailed information per listing
    - export_flats_ids_to_s3(): Saves listing IDs with checkpoint
    - export_flats_info_to_s3(): Saves detailed listing data
    - quota_is_not_reached(): Rate limiting management

S3 Storage Pattern: s3://bucket/dom-ria/YYYY/MM/DD/HH/MM/flat_*.csv

2. STN Craft Scraper Lambda (`stn-craft-scraper-lambda/`)

Purpose: E-commerce product scraping from STN Craft website

Key Features:

HTML parsing with requests-html
Product data extraction (name, price, description, images)
Prom.ua CSV export format
Multi-page scraping support

Data Collection:

Rune sets from product category pages
Product details: title, price, description, images
Automatic CSV generation for Prom.ua import

Implementation Details:

class ExportSTNCraft:
    - get_links(): Extract links from page with filtering
    - get_page(): Fetch page HTML
    - parse_product_details(): Extract product information
    - export_runes_data(): Main scraping orchestration

class Import2Prom:
    - build_prom_csv(): Generate Prom.ua import CSV

S3 Storage Pattern: s3://bucket/stn-craft/YYYY/MM/DD/stn_craft_runes_prom.csv

3. Notion API Integration (`api/notion.py`)

Purpose: Bidirectional data sync with Notion databases

Key Features:

Database schema introspection
Type-safe data import with schema validation
Data export with normalization
Support for multiple field types (text, select, multiselect, date, files)

Supported Field Types:

Rich text
Title
Select / Multi-select
Date
Files / Images
URL

Implementation Details:

class Notion:
    - build_table_request(): Construct Notion API request
    - import_data_to_table(): Insert rows into database
    - export_data_from_table(): Query and retrieve data
    - convert_notion_to_dict(): Normalize Notion response
    - get_table_column_names_list(): Schema introspection

4. Prom.ua Integration (`api/prom.py`)

Purpose: E-commerce data transformation for Prom.ua platform

Key Features:

CSV generation following Prom.ua import format
Data normalization and validation
Currency and measurement unit standardization

Implementation Details:

class Prom:
    - convert_notion2prom(): Transform Notion data to Prom format
    - build_prom_csv(): Generate importable CSV file

5. Collie Dog Data Module (`collie/`)

Purpose: Specialized scraper for dog breeding information with genealogy

Key Features:

Multi-page dog catalog scraping
Genealogy tree extraction and visualization
Graph generation with pydot
Notion database integration for dog records

Data Collection:

Dog profiles (name, gender, category)
Images and photo galleries
Pedigree information (4 generations)
Descriptive text and characteristics

Implementation Details:

Functions:
    - parse_dog_page(): Main orchestration
    - get_dog_details(): Extract profile and genealogy
    - build_dict_data_from_links(): Organize data structure
    - import_data_to_notion_db(): Upload to Notion
    - visit(): Graph traversal for genealogy visualization

Genealogy Visualization: Generates PNG graph files using pydot

6. Magic Shop Module (`magic_shop/`)

Purpose: Automated data export from Notion to Prom.ua

Key Features:

Notion database querying
Data normalization with pandas
CSV export with timestamp organization
Currency and measurement unit injection

Workflow:

Export data from Notion database
Normalize field types to flat dictionary
Add required Prom.ua fields (currency, measure)
Save to date-structured path

7. Shared Utilities (`helpers.py`)

Purpose: Common functionality across modules

Key Functions:

- create_todays_path(): Generate date-based directory structure

Date Structure: YYYY/MM/DD hierarchy for organized data storage

Dependencies

Production Dependencies (Pipfile)

[packages]
requests-html = "*"      # Web scraping with JavaScript support
pandas = "*"             # Data manipulation
httpx = "*"              # Modern HTTP client
pydot = "*"              # Graph visualization
loguru = "*"             # Structured logging
boto3 = "*"              # AWS SDK
fsspec = "*"             # Filesystem abstraction
aws-lambda-powertools = "*"  # Lambda utilities
s3fs = "*"               # S3 filesystem interface

Lambda-Specific Dependencies

Both Lambda functions share similar requirements:

aws-lambda-powertools==1.25.0
httpx==0.22.0
pandas==1.4.0
requests-html==0.10.0
boto3==1.20.52
s3fs
loguru==0.6.0
beautifulsoup4==4.10.0
numpy==1.22.2

Setup and Installation

Prerequisites

Python 3.9
AWS Account with appropriate permissions
AWS CLI configured
Docker (for SAM local testing)
Git

Local Development Setup

Clone the repository:

git clone <repository-url>
cd data_handler

Install pipenv:

pip install pipenv

Install dependencies:

make install-dependencies
# Or manually:
pipenv install --dev

Activate virtual environment:

pipenv shell

Environment Variables

Dom.ria Scraper

RIA_API_KEY=<your-dom-ria-api-key>
S3_BUCKET=eu-central-1-scraper-data

STN Craft Scraper

S3_BUCKET=eu-central-1-scraper-data

Notion Integration

NOTION_TOKEN=<your-notion-integration-token>
NOTION_DB_ID=<your-database-id>

AWS Configuration

Create S3 bucket:

aws s3 mb s3://eu-central-1-scraper-data --region eu-central-1

Configure AWS credentials:

aws configure

Deployment

Automated Deployment via GitHub Actions

The project includes two GitHub Actions workflows for automated CI/CD:

Dom.ria Scraper Deployment:

Triggered on push to main or master branches
Workflow file: .github/workflows/build_dom_ria_scraper.yaml
Stack name: dom-scraper

STN Craft Scraper Deployment:

Triggered on push to main or master branches
Workflow file: .github/workflows/build_stn_craft_scraper.yaml
Stack name: runes-scraper

Manual Deployment

Deploy Dom.ria Scraper

cd dom-ria-scraper-lambda
sam build --use-container -t dom.yaml
sam package \
  --s3-bucket eu-central-1-scraper-data \
  --output-template-file packaged.yaml \
  --region eu-central-1
sam deploy \
  --template-file packaged.yaml \
  --stack-name dom-scraper \
  --capabilities CAPABILITY_IAM \
  --region eu-central-1

Deploy STN Craft Scraper

cd stn-craft-scraper-lambda
sam build --use-container -t runes.yaml
sam package \
  --s3-bucket eu-central-1-scraper-data \
  --output-template-file packaged.yaml \
  --region eu-central-1
sam deploy \
  --template-file packaged.yaml \
  --stack-name runes-scraper \
  --capabilities CAPABILITY_IAM \
  --region eu-central-1

Local Testing

Test Lambda Functions Locally

# Dom.ria Scraper
cd dom-ria-scraper-lambda
sam local invoke DomRiaScraperFunction

# STN Craft Scraper
cd stn-craft-scraper-lambda
sam local invoke STNCraftScraperFunction

Run Unit Tests

cd stn-craft-scraper-lambda
pip install -r tests/requirements.txt
python -m pytest tests/unit -v

Usage Examples

1. Running Dom.ria Scraper

Lambda Handler: The Lambda function runs automatically via CloudWatch Events (daily at 6:00 AM), but can also be invoked manually:

# app.py main execution
if __name__ == "__main__":
    dom = DomScraper()
    flats_id_df = dom.get_flats_ids_df()
    dom.get_flats_data(flats_id_df)

Key Features:

Automatic rate limiting (1000 requests/hour)
Checkpointing for resumable scraping
Graceful quota handling with early exit

2. Running STN Craft Scraper

Lambda Handler:

def lambda_handler(event, context):
    parsed_df = ExportSTNCraft().export_runes_data()
    Import2Prom().build_prom_csv(parsed_df)

    return {
        "statusCode": 200,
        "body": json.dumps({
            "message": "Products successfully parsed and saved to s3 bucket"
        }),
    }

3. Notion Data Export

Export from Notion Database:

from api.notion import Notion

notion = Notion(token="your-token")
notion_data = notion.export_data_from_table(_id="database-id")
normalized_data = notion.convert_notion_to_dict(input_data=notion_data)

# Convert to pandas DataFrame
import pandas as pd
df = pd.DataFrame(normalized_data)
df.to_csv('exported_data.csv')

Import to Notion Database:

notion = Notion(token="your-token")

# Build request
request_body = notion.build_table_request(
    db_id="database-id",
    Name={'data_type': 'title', 'value': 'Product Name'},
    Price={'data_type': 'text', 'value': '100'},
    Category={'data_type': 'select', 'value': 'Electronics'}
)

# Import data
notion.import_data_to_table(request_body)

4. Collie Dog Data Scraping

Run the scraper:

from collie.main import parse_dog_page

# Scrape dog data
dog_data = parse_dog_page()

# Data structure:
# {
#   "category": {
#     "dog_name": {
#       "link": "...",
#       "details": {
#         "images_list": [...],
#         "breed_tree": {...},
#         "dog_description": [...]
#       }
#     }
#   }
# }

Generate genealogy graph:

import pydot

graph = pydot.Dot(graph_type='graph')
visit(dog_data['details']['breed_tree'])
graph.write_png('dog_genealogy.png')

5. Magic Shop: Notion to Prom Export

Export and convert:

from magic_shop.notion_magic_db import export_data_from_notion

# Exports Notion data and saves as CSV
df = export_data_from_notion()

# Output: data/YYYY/MM/DD/exported_notion_db_data.csv

CI/CD Pipeline

GitHub Actions Workflow Architecture

Both scrapers use identical CI/CD pipeline structure:

Pipeline Steps:

Checkout: Clone repository
Setup Python: Install Python 3.9.10
Setup SAM: Install AWS SAM CLI
Configure AWS: Authenticate with AWS credentials
Build: Create Lambda deployment package with Docker
Package: Upload artifacts to S3
Deploy: Deploy CloudFormation stack

Secrets Required:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Deployment Region: eu-central-1

Stack Names:

Dom.ria: dom-scraper
STN Craft: runes-scraper

Continuous Deployment Strategy

Trigger: Push to main or master branches
Strategy: Full stack replacement
Rollback: CloudFormation automatic rollback on failure
Zero Downtime: Lambda versioning with alias updates

Configuration

SAM Templates

Dom.ria Scraper (dom.yaml)

Globals:
  Function:
    Timeout: 900          # 15 minutes
    MemorySize: 256       # 256 MB
    Environment:
      Variables:
        RIA_API_KEY: <api-key>
        S3_BUCKET: eu-central-1-scraper-data

Resources:
  DomRiaScraperFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.9
      Handler: app.lambda_handler
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: cron(0 6 * * ? *)  # Daily at 6 AM UTC

STN Craft Scraper (runes.yaml)

Globals:
  Function:
    Timeout: 900          # 15 minutes
    MemorySize: 256       # 256 MB
    Environment:
      Variables:
        S3_BUCKET: eu-central-1-scraper-data

Resources:
  STNCraftScraperFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.9
      Handler: app.lambda_handler
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: cron(0 6 * * ? *)  # Daily at 6 AM UTC

Makefile Commands

help                    # Display available commands
install-dependencies    # Install pipenv and all dependencies
create-artifact-bucket  # Create S3 bucket for SAM artifacts
deploy-stn-craft-scraper # Build and deploy STN Craft scraper

Data Flow

Dom.ria Scraper Flow

1. CloudWatch Event (6:00 AM UTC)
   │
   ▼
2. Lambda Invocation
   │
   ▼
3. Get Listing IDs (paginated)
   │  - API Rate Limiting: 1000 req/hour
   │  - Checkpoint every page
   │
   ▼
4. For Each Listing ID:
   │  - Fetch detailed information
   │  - Handle timeouts and retries
   │  - Quota monitoring
   │
   ▼
5. Export to S3
   │  - Path: /dom-ria/YYYY/MM/DD/HH/MM/
   │  - Files: flat_ids_list_*.csv, flats_data_list_*.csv
   │
   ▼
6. Lambda Completion

STN Craft Scraper Flow

1. CloudWatch Event (6:00 AM UTC)
   │
   ▼
2. Lambda Invocation
   │
   ▼
3. Scrape Product Pages
   │  - Category page: /product-category/rune-sets/
   │  - Extract product links
   │
   ▼
4. For Each Product:
   │  - Parse product details
   │  - Extract images, price, description
   │  - Build product dictionary
   │
   ▼
5. Convert to Prom Format
   │  - Load CSV template
   │  - Normalize prices
   │  - Add currency and units
   │
   ▼
6. Export to S3
   │  - Path: /stn-craft/YYYY/MM/DD/
   │  - File: stn_craft_runes_prom.csv
   │
   ▼
7. Lambda Completion

Notion Integration Flow

1. Manual Execution
   │
   ▼
2. Query Notion Database
   │  - Database ID from environment
   │  - API Version: 2021-08-16
   │
   ▼
3. Convert Notion Types
   │  - Rich text → plain text
   │  - Select → name
   │  - Files → URL list
   │
   ▼
4. Normalize to DataFrame
   │  - Flatten nested structures
   │  - Handle missing values
   │
   ▼
5. Export or Import
   │  - Export: CSV to date-structured path
   │  - Import: Build Notion API request
   │
   ▼
6. Completion

AWS Resources

Lambda Functions

Dom.ria Scraper:

Function Name: DomRiaScraperFunction
Runtime: Python 3.9
Memory: 256 MB
Timeout: 900 seconds (15 minutes)
Trigger: CloudWatch Events (cron)
Permissions: S3 write access

STN Craft Scraper:

Function Name: STNCraftScraperFunction
Runtime: Python 3.9
Memory: 256 MB
Timeout: 900 seconds (15 minutes)
Trigger: CloudWatch Events (cron)
Permissions: S3 write access

S3 Bucket Structure

Bucket Name: eu-central-1-scraper-data

Structure:

s3://eu-central-1-scraper-data/
├── dom-ria/
│   └── YYYY/
│       └── MM/
│           └── DD/
│               └── HH/
│                   └── MM/
│                       ├── flat_ids_list_HH_page_N.csv
│                       └── flats_data_list_HH_page.csv
│
└── stn-craft/
    └── YYYY/
        └── MM/
            └── DD/
                └── stn_craft_runes_prom.csv

CloudWatch Events Rules

Dom.ria Schedule:

Rule Name: daily-run-dom-ria
Schedule: cron(0 6 * * ? *)
Description: Daily run at 06:00 AM UTC
State: Enabled

STN Craft Schedule:

Rule Name: daily-run
Schedule: cron(0 6 * * ? *)
Description: Daily run at 06:00 AM UTC
State: Enabled

IAM Roles

Auto-generated by CloudFormation with least-privilege permissions:

CloudWatch Logs write access
S3 bucket read/write access
Lambda execution role

Development Notes

Recent Changes (Git History)

6cf02f8 - fixed pandas errors
c66c1e3 - fixed pandas errors
64bcbcb - adding env variables
a1e78b0 - adding env variables
cdff19b - adding env variables
ab773bf - adding env variables
d6956b0 - updated: change mkdir to mkdirstring
0f9d38e - changed cron name
7d46ccb - changed pipeline name
435ddf2 - added: dom-ria scraper updated: stn data

Known Issues and Considerations

API Keys: Hard-coded API keys present in configuration files (should be rotated before archival)
Rate Limiting: Dom.ria API limited to 1000 requests/hour
Pandas Deprecation: .append() method deprecated in newer pandas versions
Error Handling: Some functions use recursive retry logic which could stack overflow
S3 Permissions: Lambda functions require appropriate IAM roles for S3 access

Design Decisions

Date-based Partitioning: S3 storage organized by date for easy data lifecycle management
Serverless Architecture: No infrastructure management, automatic scaling
Scheduled Execution: Daily runs at 6:00 AM UTC to capture fresh data
CSV Output Format: Prom.ua-compatible format for direct import
Checkpointing: Intermediate saves to handle quota limits and timeouts

Future Improvements (Not Implemented)

Move API keys to AWS Secrets Manager
Implement SNS notifications for scraper failures
Add data validation and quality checks
Implement incremental scraping (delta detection)
Add CloudWatch dashboards for monitoring
Upgrade pandas code to use concat() instead of deprecated append()
Add comprehensive error handling and alerting
Implement data archival policies for S3

Testing Strategy

Unit tests exist for STN Craft scraper (stn-craft-scraper-lambda/tests/)
Integration tests scaffolding present but minimal coverage
Local testing available via SAM CLI
No automated test execution in CI/CD pipeline

Performance Characteristics

Dom.ria Scraper:

Average execution: 10-15 minutes (depends on listing count)
API calls: ~100-150 per execution
Data output: 2-5 MB per day

STN Craft Scraper:

Average execution: 2-5 minutes
HTTP requests: ~30-50 per execution
Data output: <1 MB per day

Logging and Monitoring

Lambda Powertools: Structured JSON logging
Log Groups: /aws/lambda/DomRiaScraperFunction, /aws/lambda/STNCraftScraperFunction
Metrics: Available via CloudWatch Lambda metrics
Retention: Default CloudWatch Logs retention

License

No license information provided in the repository.

Project Status

ARCHIVED - This repository is being archived. All documentation has been preserved for future reference.

Archive Date

January 5, 2026

Archival Context

This project served as a personal data automation and integration system combining web scraping, cloud infrastructure, and API integrations. The codebase demonstrates serverless architecture patterns, automated data pipelines, and multi-platform data synchronization.

Contact Information

No contact information provided in the repository.

Archive Note: This README was generated for archival purposes and represents the complete state of the project as of the archival date. All code, configurations, and documentation are preserved as-is for historical reference.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
api		api
collie		collie
dom-ria-scraper-lambda		dom-ria-scraper-lambda
magic_shop		magic_shop
stn-craft-scraper-lambda		stn-craft-scraper-lambda
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
helpers.py		helpers.py

bodharma/data_handler

Folders and files

Latest commit

History

Repository files navigation

Data Handler

Table of Contents

Project Overview

Architecture

High-Level Architecture

Component Architecture

Technology Stack

Core Technologies

Key Libraries and Frameworks

Web Scraping

Data Processing

AWS Integration

Utilities

Infrastructure

Project Structure

Modules and Components

1. Dom.ria Scraper Lambda (dom-ria-scraper-lambda/)

2. STN Craft Scraper Lambda (stn-craft-scraper-lambda/)

3. Notion API Integration (api/notion.py)

4. Prom.ua Integration (api/prom.py)

5. Collie Dog Data Module (collie/)

6. Magic Shop Module (magic_shop/)

7. Shared Utilities (helpers.py)

Dependencies

Production Dependencies (Pipfile)

Lambda-Specific Dependencies

Setup and Installation

Prerequisites

Local Development Setup

Environment Variables

Dom.ria Scraper

STN Craft Scraper

Notion Integration

AWS Configuration

Deployment

Automated Deployment via GitHub Actions

Manual Deployment

Deploy Dom.ria Scraper

Deploy STN Craft Scraper

Local Testing

Test Lambda Functions Locally

Run Unit Tests

Usage Examples

1. Running Dom.ria Scraper

2. Running STN Craft Scraper

3. Notion Data Export

4. Collie Dog Data Scraping

5. Magic Shop: Notion to Prom Export

CI/CD Pipeline

GitHub Actions Workflow Architecture

Continuous Deployment Strategy

Configuration

SAM Templates

Dom.ria Scraper (dom.yaml)

STN Craft Scraper (runes.yaml)

Makefile Commands

Data Flow

Dom.ria Scraper Flow

STN Craft Scraper Flow

Notion Integration Flow

AWS Resources

Lambda Functions

S3 Bucket Structure

CloudWatch Events Rules

IAM Roles

Development Notes

Recent Changes (Git History)

Known Issues and Considerations

Design Decisions

Future Improvements (Not Implemented)

Testing Strategy

Performance Characteristics

Logging and Monitoring

License

Project Status

1. Dom.ria Scraper Lambda (`dom-ria-scraper-lambda/`)

2. STN Craft Scraper Lambda (`stn-craft-scraper-lambda/`)

3. Notion API Integration (`api/notion.py`)

4. Prom.ua Integration (`api/prom.py`)

5. Collie Dog Data Module (`collie/`)

6. Magic Shop Module (`magic_shop/`)

7. Shared Utilities (`helpers.py`)

Packages