E-commerce Analytics Pipeline

Overview

Marketing analytics for a large eCommerce events dataset from REES46 Marketing Platform.

A production-style data pipeline that processes 400M+ e-commerce events to generate customer analytics and business insights.
Built using modern data engineering tools (dbt, Dagster, BigQuery) to demonstrate scalable analytics infrastructure and best practices.
The pipeline automates data ingestion, transformation, and metric calculation for customer segmentation (RFM analysis), conversion funnel tracking, churn identification, and other KPIs.

Pipeline Architecture

Key Findings

RFM Segmentation: Champions (12% of customers) generate 3x higher revenue ($3,333 avg); 135K high-value "At Risk" customers identified for retention campaigns
Churn: 88% of early customers did not make a repeat purchase within 90 days
Conversion Funnel: 6.1% view-to-purchase rate; 88% drop-off before cart, 49% cart abandonment

Tech Stack

Data Warehouse: BigQuery - serverless, scales to petabytes, native partitioning/clustering
Transformation: dbt Core - version-controlled SQL, built-in testing, lineage tracking
Orchestration: Dagster - asset-based paradigm, first-class dbt integration, superior observability
Infrastructure: Google Cloud Platform - seamless BigQuery integration, cost-effective compute
Infrastructure as Code: Terraform - declarative, reproducible infrastructure with state management
CI/CD: GitHub Actions - native repo integration, matrix builds for parallel testing
Visualization: Tableau - handles large datasets, flexible for both operational and strategic dashboards

Data Models

Staging Layer

stg_events - Cleaned event data

Dimension Tables

dim_users - User-level metrics (LTV, churn status, purchase history)
dim_products - Product attributes and category hierarchy
dim_categories - Category taxonomy
dim_user_rfm - RFM scores and customer segments

Fact Tables

fct_events - Event-level facts with purchase/cart/view flags
fct_sessions - Session-level aggregations with conversion funnel

Metrics

metrics_conversion_rates - Daily/overall conversion metrics
metrics_churn - Churn rates by cohort
metrics_rfm_segments - Aggregated segment-level metrics

Snapshots (SCD Type 2)

snap_user_rfm - Tracks changes to RFM segments over time
snap_user_status - Tracks changes to user activity/churn status

Lineage Graph

Getting Started

Prerequisites

Python 3.9 - 3.13
Terraform >= 1.0
gcloud CLI authenticated with your GCP project
GCP Project with BigQuery and Cloud Storage APIs enabled
Service Account with roles: BigQuery Admin, Storage Admin

Infrastructure Setup

Provision the required GCP resources using Terraform:

cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID, bucket name, and Slack token
terraform init
terraform apply

This provisions a GCP VM with Dagster running as systemd services. Access the Dagster UI via SSH tunnel:

gcloud compute ssh terraform-instance --zone=us-central1-a -- -L 3000:localhost:3000
# Then open http://localhost:3000 in your browser

See terraform/README.md for detailed setup instructions.

dbt Configuration

Configure dbt to connect to your BigQuery instance:

cd dbt-project
cp profiles.yml.example profiles.yml
# Edit profiles.yml with your GCP project ID and service account key path

Running the Pipeline

Upload raw data CSV files to the GCS bucket
Dagster sensors automatically detect new files and trigger data loads
Transformations run automatically after successful loads

Documentation

Analysis & Visualizations - Dashboard screenshots and detailed insights
Orchestration - Dagster sensors, jobs, and scheduling patterns
dbt Docs - Run cd dbt-project && dbt docs generate && dbt docs serve to view model documentation and lineage locally

Name		Name	Last commit message	Last commit date
Latest commit History 363 Commits
.github/workflows		.github/workflows
analysis		analysis
dagster-project		dagster-project
dbt-project		dbt-project
docs		docs
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Analytics Pipeline

Overview

Contents

Pipeline Architecture

Key Findings

Tech Stack

Data Models

Lineage Graph

Getting Started

Prerequisites

Infrastructure Setup

dbt Configuration

Running the Pipeline

Documentation

About

Uh oh!

Uh oh!

Languages

License

vbalalian/estore-analytics

Folders and files

Latest commit

History

Repository files navigation

E-commerce Analytics Pipeline

Overview

Contents

Pipeline Architecture

Key Findings

Tech Stack

Data Models

Lineage Graph

Getting Started

Prerequisites

Infrastructure Setup

dbt Configuration

Running the Pipeline

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages