From 2833b4679e3d1dc8767477bb17ff07db65f581ad Mon Sep 17 00:00:00 2001
From: dmytro kovalskyi <dmytro.kovalskyi@inventorsoft.co>
Date: Wed, 25 Mar 2026 16:57:29 +0200
Subject: [PATCH] adding onboarding.md

---
 ONBOARDING.md | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 ONBOARDING.md

diff --git a/ONBOARDING.md b/ONBOARDING.md
new file mode 100644
index 0000000..023ac89
--- /dev/null
+++ b/ONBOARDING.md
@@ -0,0 +1,204 @@
+# Onboarding Guide: Local Coffee Shop Analytics Showcase
+
+This document helps new team members become productive quickly on the `showcase_local_coffee_shop` project.
+
+## 1. Project Purpose
+
+This showcase demonstrates a full analytics lifecycle:
+- generate synthetic operational data,
+- upload/load it into BigQuery,
+- transform it with dbt,
+- expose analytics-ready datasets for BI dashboards.
+
+Primary business questions covered:
+- sales trends and store performance,
+- customer behavior and loyalty segmentation,
+- product mix and basket composition.
+
+## 2. System Architecture
+
+```mermaid
+flowchart LR
+    A[Data Generation\nPython Script or Cloud Run Job] --> B[GCS Bucket\ncsv_sources/]
+    B --> C[Cloud Function\nload_to_bq]
+    C --> D[BigQuery Raw Tables\ncustomers/orders/order_details/products/stores]
+    D --> E[dbt Models\nDimensions + Facts + Data Marts]
+    E --> F[Looker Studio\nSales + Customer Dashboards]
+
+    S1[Cloud Scheduler\ncoffee-job-schedule] --> T[Cloud Function\ntrigger-job]
+    T --> A
+    S2[Cloud Scheduler\nload-bq-daily] --> C
+```
+
+## 3. Repository Map
+
+```text
+showcase_local_coffee_shop/
+  dataset_generation/      # Local synthetic data generation script
+  dataset_upload/          # Local upload script to GCS
+  dbt_models/              # dbt project for transformations
+  google_cloud_run/        # Cloud Run jobs/functions for automation
+    generate_and_store/
+    load_to_bq/
+    trigger_cloud_run_job/
+```
+
+## 4. Day-1 Setup Checklist
+
+1. Install Python 3.9+.
+2. Create and activate a virtual environment.
+3. Install dependencies from root `requirements.txt`.
+4. Install and authenticate Google Cloud SDK (`gcloud auth application-default login`).
+5. Verify BigQuery and GCS access in the target GCP project.
+6. Configure dbt profile in `dbt_models/profiles.yml`:
+   - `project`
+   - `dataset`
+   - `keyfile` path (or preferred auth method)
+7. Run a local end-to-end dry run (generation -> upload -> dbt).
+
+Recommended local setup commands:
+
+```bash
+cd /Users/l-76/repos/ai-workshop/data-analyst-portfolio
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+## 5. Local Development Workflow
+
+### 5.1 Generate CSV Data Locally
+
+```bash
+cd showcase_local_coffee_shop/dataset_generation
+python generate_data.py
+```
+
+Expected output: CSV files for customers, orders, order details, products, and stores.
+
+### 5.2 Upload CSVs to GCS
+
+Before running, review values in `dataset_upload/upload_csv_to_gcs.py`:
+- `BUCKET_NAME`
+- `folder_path`
+- `GCS_FOLDER`
+
+```bash
+cd showcase_local_coffee_shop/dataset_upload
+python upload_csv_to_gcs.py
+```
+
+### 5.3 Run dbt Transformations
+
+```bash
+cd showcase_local_coffee_shop/dbt_models
+dbt debug
+dbt run
+dbt test
+```
+
+## 6. Cloud Automation Flow
+
+The production-style pipeline uses Cloud Scheduler + Cloud Functions + Cloud Run.
+
+```mermaid
+sequenceDiagram
+    participant CS1 as Cloud Scheduler (coffee-job-schedule)
+    participant CF1 as Cloud Function (trigger-job)
+    participant CR as Cloud Run Job (coffee-data-job)
+    participant GCS as GCS bucket
+    participant CS2 as Cloud Scheduler (load-bq-daily)
+    participant CF2 as Cloud Function (load-bq-from-csv)
+    participant BQ as BigQuery
+
+    CS1->>CF1: HTTP trigger
+    CF1->>CR: Run job via Cloud Run API
+    CR->>GCS: Write fresh CSV files
+    CS2->>CF2: HTTP trigger (delayed schedule)
+    CF2->>BQ: Load CSVs to raw tables (WRITE_TRUNCATE)
+```
+
+## 7. dbt Modeling Overview
+
+The dbt project builds dimensional and reporting layers directly from source tables.
+
+```mermaid
+flowchart TB
+    S1[source.customers] --> D1[dim_customer]
+    S2[source.orders] --> D2[dim_order]
+    S3[source.products] --> D3[dim_product]
+    S4[source.stores] --> D4[dim_store]
+    S2 --> F1[fact_order_details]
+    S5[source.order_details] --> F1
+    D1 --> F1
+    D2 --> F1
+    D3 --> F1
+    D4 --> F1
+    F1 --> M1[flat_order_details]
+    F1 --> M2[flat_customer_metrics]
+```
+
+## 8. Operational Notes
+
+- BigQuery load currently uses `WRITE_TRUNCATE` in `google_cloud_run/load_to_bq/main.py`.
+- Scheduler ordering matters: generation must complete before load starts.
+- Existing scripts include absolute Windows-style local paths in some files; update them for your machine.
+
+## 9. Token Footprint Evaluation
+
+Token sizing helps estimate AI context and prompt budget for each area.
+
+Method used:
+- measured total bytes per component,
+- estimated tokens with `tokens ~= bytes / 4` (rough English/code heuristic).
+
+### 9.1 Core Components
+
+| Component | Files | Lines | Bytes | Approx Tokens |
+|---|---:|---:|---:|---:|
+| `showcase_local_coffee_shop/dataset_generation` | 1 | 285 | 10,760 | 2,690 |
+| `showcase_local_coffee_shop/dataset_upload` | 1 | 25 | 829 | 207 |
+| `showcase_local_coffee_shop/dbt_models` | 20 | 353 | 11,082 | 2,770 |
+| `showcase_local_coffee_shop/google_cloud_run` | 9 | 274 | 10,536 | 2,634 |
+
+Approx total for these implementation components: about `8,301` tokens.
+
+### 9.2 Main Documentation + Root Config
+
+| Path | Lines | Bytes | Approx Tokens |
+|---|---:|---:|---:|
+| `README.md` | 85 | 3,683 | 920 |
+| `requirements.txt` | 6 | 314 | 78 |
+| `showcase_local_coffee_shop/README.md` | 268 | 11,315 | 2,828 |
+
+Approx total for these docs/config files: about `3,826` tokens.
+
+### 9.3 Practical Prompting Guidance
+
+- If you include all core implementation components and main docs together, budget roughly `12,100` tokens.
+- Keep 20-30% headroom for instructions and model responses.
+- For focused tasks, include only one subsystem:
+  - data generation + upload: about `2,900` tokens,
+  - dbt-only work: about `2,770` tokens,
+  - cloud-run automation-only work: about `2,634` tokens.
+
+## 10. Suggested 30-60-90 Minute Onboarding Plan
+
+- First 30 minutes:
+  - read this guide and `showcase_local_coffee_shop/README.md`,
+  - confirm local Python and GCP auth.
+- Next 60 minutes:
+  - run local generation/upload,
+  - validate raw tables in BigQuery.
+- Next 90 minutes:
+  - run `dbt run` + `dbt test`,
+  - inspect final marts and one dashboard.
+
+## 11. Definition of Done for New Team Members
+
+A new team member is considered onboarded when they can:
+1. Generate and upload synthetic data.
+2. Load raw tables to BigQuery (manually or through cloud path).
+3. Run dbt models successfully.
+4. Explain source -> dimensional/fact -> mart flow.
+5. Troubleshoot one failed step using logs.