Marketing analytics for a large eCommerce events dataset from REES46 Marketing Platform.
- A production-style data pipeline that processes 400M+ e-commerce events to generate customer analytics and business insights.
- Built using modern data engineering tools (dbt, Dagster, BigQuery) to demonstrate scalable analytics infrastructure and best practices.
- The pipeline automates data ingestion, transformation, and metric calculation for customer segmentation (RFM analysis), conversion funnel tracking, churn identification, and other KPIs.
- RFM Segmentation: Champions (12% of customers) generate 3x higher revenue ($3,333 avg); 135K high-value "At Risk" customers identified for retention campaigns
- Churn: 88% of early customers did not make a repeat purchase within 90 days
- Conversion Funnel: 6.1% view-to-purchase rate; 88% drop-off before cart, 49% cart abandonment
- Data Warehouse: BigQuery - serverless, scales to petabytes, native partitioning/clustering
- Transformation: dbt Core - version-controlled SQL, built-in testing, lineage tracking
- Orchestration: Dagster - asset-based paradigm, first-class dbt integration, superior observability
- Infrastructure: Google Cloud Platform - seamless BigQuery integration, cost-effective compute
- Infrastructure as Code: Terraform - declarative, reproducible infrastructure with state management
- CI/CD: GitHub Actions - native repo integration, matrix builds for parallel testing
- Visualization: Tableau - handles large datasets, flexible for both operational and strategic dashboards
Staging Layer
stg_events- Cleaned event data
Dimension Tables
dim_users- User-level metrics (LTV, churn status, purchase history)dim_products- Product attributes and category hierarchydim_categories- Category taxonomydim_user_rfm- RFM scores and customer segments
Fact Tables
fct_events- Event-level facts with purchase/cart/view flagsfct_sessions- Session-level aggregations with conversion funnel
Metrics
metrics_conversion_rates- Daily/overall conversion metricsmetrics_churn- Churn rates by cohortmetrics_rfm_segments- Aggregated segment-level metrics
Snapshots (SCD Type 2)
snap_user_rfm- Tracks changes to RFM segments over timesnap_user_status- Tracks changes to user activity/churn status
- Python 3.9 - 3.13
- Terraform >= 1.0
- gcloud CLI authenticated with your GCP project
- GCP Project with BigQuery and Cloud Storage APIs enabled
- Service Account with roles: BigQuery Admin, Storage Admin
Provision the required GCP resources using Terraform:
cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID, bucket name, and Slack token
terraform init
terraform applyThis provisions a GCP VM with Dagster running as systemd services. Access the Dagster UI via SSH tunnel:
gcloud compute ssh terraform-instance --zone=us-central1-a -- -L 3000:localhost:3000
# Then open http://localhost:3000 in your browserSee terraform/README.md for detailed setup instructions.
Configure dbt to connect to your BigQuery instance:
cd dbt-project
cp profiles.yml.example profiles.yml
# Edit profiles.yml with your GCP project ID and service account key path- Upload raw data CSV files to the GCS bucket
- Dagster sensors automatically detect new files and trigger data loads
- Transformations run automatically after successful loads
- Analysis & Visualizations - Dashboard screenshots and detailed insights
- Orchestration - Dagster sensors, jobs, and scheduling patterns
- dbt Docs - Run
cd dbt-project && dbt docs generate && dbt docs serveto view model documentation and lineage locally
