Skip to content

Commit c4e509d

Browse files
committed
feat: update
1 parent 24afda6 commit c4e509d

17 files changed

Lines changed: 617 additions & 354 deletions

web/content/blog/01-intro-data-lakehouse.md

Lines changed: 52 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ description: "Learn why lakehouse architecture combines the best of data lakes a
1717
## Prerequisites
1818

1919
- None. Start here.
20-
- Optional: [Part 2: Getting StartedSetup Guide](02-setup-guide.md) if you want to run services now.
20+
- Optional: [Part 2: Getting Started - Setup Guide](02-setup-guide.md) if you want to run services now.
2121

2222
## The Problem We're Solving
2323

@@ -26,10 +26,10 @@ Traditional data pipelines have a fundamental problem: **they force you to choos
2626
Either you have:
2727

2828
- **A Data Lake**: cheap, flexible storage but chaotic and hard to query
29-
- **A Data Warehouse**: organized, fast queries but rigid and expensive
29+
- **A Data Warehouse**: organised, fast queries but rigid and expensive
3030

3131
Phlo solves this by combining the best of both worlds into a **lakehouse**.
32-
If you want hands-on setup next, jump to [Part 2: Getting StartedSetup Guide](02-setup-guide.md).
32+
If you want hands-on setup next, jump to [Part 2: Getting Started - Setup Guide](02-setup-guide.md).
3333

3434
## The Three Eras of Data Architecture
3535

@@ -43,7 +43,7 @@ If you want hands-on setup next, jump to [Part 2: Getting Started—Setup Guide]
4343

4444
- Store raw data cheaply in object storage
4545
- Flexible schema
46-
- Problem: "Swamp" syndromedata is disorganized, hard to query, poor governance
46+
- Problem: "Swamp" syndrome - data is disorganised, hard to query, poor governance
4747

4848
### Era 3: The Data Lakehouse (2020s+)
4949

@@ -94,7 +94,7 @@ graph TB
9494

9595
### 1. Apache Iceberg (Table Format)
9696

97-
Imagine you're storing data in a filing cabinet. A table format is the **file organization system** that lets you:
97+
Imagine you're storing data in a filing cabinet. A table format is the **file organisation system** that lets you:
9898

9999
```
100100
Instead of:
@@ -217,23 +217,58 @@ def publish_marts() -> None:
217217

218218
## The Data Flow in Phlo
219219

220-
```mermaid
221-
flowchart TD
222-
A[Nightscout API] -->|DLT + PyIceberg| B[S3 Staging - MinIO]
223-
B -->|Merge with dedup| C[raw.glucose_entries]
224-
C -->|dbt + Trino| D[bronze.stg_entries]
225-
D --> E[silver.fct_readings]
226-
E --> F[gold.dim_date]
227-
F -->|Trino to Postgres| G[marts.mrt_glucose_overview]
228-
G -->|SQL queries| H[Superset Dashboard]
220+
```
221+
1. INGEST
222+
┌─────────────────────┐
223+
│ Nightscout API │
224+
│ (glucose data) │
225+
└──────────┬──────────┘
226+
227+
↓ (DLT + PyIceberg)
228+
┌─────────────────────────────────┐
229+
│ S3 Staging (MinIO) │ ← Temporary parquet files
230+
└──────────┬──────────────────────┘
231+
232+
↓ (Merge with dedup)
233+
┌──────────────────────────────────────┐
234+
│ Iceberg Table: raw.glucose_entries │ ← Immutable, ACID
235+
│ Branch: main (production) │
236+
└──────────┬───────────────────────────┘
237+
238+
2. TRANSFORM
239+
↓ (dbt + Trino)
240+
┌──────────────────────────────────────┐
241+
│ Iceberg Table: bronze.stg_entries │ ← Type conversions
242+
└──────────┬───────────────────────────┘
243+
244+
245+
┌──────────────────────────────────────┐
246+
│ Iceberg Table: silver.fct_readings │ ← Business logic
247+
└──────────┬───────────────────────────┘
248+
249+
250+
┌──────────────────────────────────────┐
251+
│ Iceberg Table: gold.dim_date │ ← Dimensions
252+
└──────────┬───────────────────────────┘
253+
254+
3. PUBLISH
255+
↓ (Trino → Postgres)
256+
┌──────────────────────────────────────┐
257+
│ Postgres: marts.mrt_glucose_overview │ ← Fast for BI
258+
└──────────┬───────────────────────────┘
259+
260+
↓ (SQL queries)
261+
┌──────────────────────────────────────┐
262+
│ Superset Dashboard │ ← Visualisation
263+
└──────────────────────────────────────┘
229264
```
230265

231266
## Why This Matters (Real Benefits)
232267

233268
| Problem | Traditional | Phlo Solution |
234269
| -------------- | ------------------------- | ---------------------------- |
235270
| Data costs | High (warehouse fees) | Low (S3 storage) |
236-
| Query speed | Fast | Fast (Trino optimization) |
271+
| Query speed | Fast | Fast (Trino optimisation) |
237272
| Schema changes | Painful rewrites | Easy evolution |
238273
| Governance | Manual processes | Git-like branching |
239274
| Vendor lock-in | Yes (Snowflake, Redshift) | No (open formats) |
@@ -300,7 +335,7 @@ See [Troubleshooting Guide](../operations/troubleshooting.md) for deeper diagnos
300335

301336
## See Also
302337

303-
See also: [Part 2: Getting StartedSetup Guide](02-setup-guide.md), [Part 3: Apache Iceberg Explained](03-apache-iceberg-explained.md), [Part 4: Project Nessie Versioning](04-project-nessie-versioning.md). Reference: [Architecture Overview](../reference/architecture.md).
338+
See also: [Part 2: Getting Started - Setup Guide](02-setup-guide.md), [Part 3: Apache Iceberg Explained](03-apache-iceberg-explained.md), [Part 4: Project Nessie Versioning](04-project-nessie-versioning.md). Reference: [Architecture Overview](../reference/architecture.md).
304339

305340
## Summary
306341

@@ -317,4 +352,4 @@ Ready to build? In Part 2, we'll:
317352
- Start all services with one command
318353
- Run your first data pipeline
319354

320-
**Next**: [Part 2: Getting StartedSetup Guide](02-setup-guide.md)
355+
**Next**: [Part 2: Getting Started - Setup Guide](02-setup-guide.md)

web/content/blog/02-setup-guide.md

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@ title: "Getting Started with Phlo — Setup Guide"
33
description: "Install Phlo, bootstrap a new project, start the service stack, and run your first data pipeline in under 20 minutes."
44
---
55

6-
# Part 2: Getting Started with PhloSetup Guide
6+
# Part 2: Getting Started with Phlo - Setup Guide
77

88
> Prerequisite: Read [Part 1: What is a Data Lakehouse?](01-intro-data-lakehouse.md) for core concepts.
99
1010
## What You'll Learn
1111

1212
- Bootstrap a new Phlo project with `phlo init`
13-
- Initialize and start the service stack
13+
- Initialise and start the service stack
1414
- Ingest sample data and materialize assets
1515
- Verify results in Dagster and Observatory
1616

@@ -54,7 +54,7 @@ description: "Install Phlo, bootstrap a new project, start the service stack, an
5454

5555
If you have less than 4GB RAM, you can start a minimal setup (Postgres + MinIO only) and add services gradually.
5656

57-
## Step 1: Initialize Your Project
57+
## Step 1: Initialise Your Project
5858

5959
```bash
6060
# Create a new Phlo project
@@ -69,7 +69,7 @@ cd my-lakehouse
6969
```
7070

7171

72-
Then initialize infra (generates `.phlo/.env` and `.phlo/.env.local`):
72+
Then initialise infra (generates `.phlo/.env` and `.phlo/.env.local`):
7373

7474
```bash
7575
phlo services init
@@ -281,12 +281,16 @@ Open **Dagster** at http://localhost:3000
281281

282282
You should see the asset graph:
283283

284-
```mermaid
285-
flowchart TD
286-
A[glucose_entries] --> B[stg_glucose_entries]
287-
B --> C[fct_glucose_readings]
288-
C --> D[fct_daily_glucose_metrics]
289-
D --> E[postgres_marts]
284+
```
285+
glucose_entries
286+
287+
stg_glucose_entries (dbt)
288+
289+
fct_glucose_readings (dbt)
290+
291+
fct_daily_glucose_metrics
292+
293+
postgres_marts
290294
```
291295

292296
Click on `glucose_entries` → Click **Materialize this asset**
@@ -340,11 +344,14 @@ This will:
340344

341345
Watch it propagate through the graph:
342346

343-
```mermaid
344-
flowchart TD
345-
A[glucose_entries ✓] --> B[stg_glucose_entries ⏳]
346-
B --> C[fct_glucose_readings ⏳]
347-
C --> D[postgres_marts ⏳]
347+
```
348+
glucose_entries [SUCCESS]
349+
350+
stg_glucose_entries ⏳ (running)
351+
352+
fct_glucose_readings ⏳ (waiting)
353+
354+
postgres_marts ⏳ (waiting)
348355
```
349356

350357
### 5d: Check Results
@@ -436,7 +443,7 @@ LIMIT 24
436443
5. Click **Update Chart**
437444
6. Click **Save Chart**
438445

439-
Congratulations! You've visualized real glucose data from a lakehouse.
446+
Congratulations! You've visualised real glucose data from a lakehouse.
440447

441448
## Hands-On Exercise: Re-run the Pipeline
442449

@@ -544,9 +551,9 @@ You've successfully:
544551
- Ran transformations
545552
- Created a dashboard
546553

547-
In Part 3, we'll dive deep into **Apache Iceberg**—the magic that makes this lakehouse work.
554+
In Part 3, we'll cover **Apache Iceberg** and how it manages table metadata, snapshots, and schema changes.
548555

549556
## Next Steps
550557

551-
- Continue with [Part 3: Apache IcebergThe Table Format That Changed Everything](03-apache-iceberg-explained.md).
558+
- Continue with [Part 3: Apache Iceberg - The Table Format That Changed Everything](03-apache-iceberg-explained.md).
552559
- Jump to ingestion specifics in [Part 5: Data Ingestion](05-data-ingestion.md).

web/content/blog/03-apache-iceberg-explained.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: "Apache Iceberg — The Table Format That Changed Everything"
33
description: "Understand how Apache Iceberg enables ACID transactions, time travel, and schema evolution on your data lake."
44
---
55

6-
# Part 3: Apache IcebergThe Table Format That Changed Everything
6+
# Part 3: Apache Iceberg - The Table Format That Changed Everything
77

88
> Prerequisite: Read [Part 1: What is a Data Lakehouse?](01-intro-data-lakehouse.md) for lakehouse context.
99
@@ -17,9 +17,9 @@ description: "Understand how Apache Iceberg enables ACID transactions, time trav
1717
## Prerequisites
1818

1919
- [Part 1: What is a Data Lakehouse?](01-intro-data-lakehouse.md)
20-
- Optional: [Part 2: Getting StartedSetup Guide](02-setup-guide.md) to run the examples locally.
20+
- Optional: [Part 2: Getting Started - Setup Guide](02-setup-guide.md) to run the examples locally.
2121

22-
In Part 1, we mentioned Iceberg as the magic ingredient. Let's understand _why_ it's such a game-changer.
22+
In Part 1, we introduced Iceberg as the table layer. Here's why it matters in day-to-day pipelines.
2323
For Git-like versioning on top of Iceberg, see [Part 4: Project Nessie Versioning](04-project-nessie-versioning.md).
2424

2525
## The Problem With Traditional Parquet
@@ -43,7 +43,7 @@ s3://lake/glucose-data/
4343
- Schema changes require rewriting all files
4444
- Queries must scan ALL files (no partition pruning)
4545
- Concurrent writes = conflicting files
46-
- No time traveldata is gone when you delete it
46+
- No time travel - data is gone when you delete it
4747

4848
## What Iceberg Provides
4949

@@ -90,7 +90,7 @@ flowchart TB
9090

9191
### 1. Snapshots (Immutable Versions)
9292

93-
Each write creates a new **snapshot**a complete, immutable view of the table at that moment:
93+
Each write creates a new **snapshot** - a complete, immutable view of the table at that moment:
9494

9595
```python
9696
# In Python, using PyIceberg
@@ -137,7 +137,7 @@ Snapshot 1234567892:
137137
└── data/year=2024/month=10/day=02/00004.parquet → rows 101-200
138138
```
139139

140-
Why manifests? Query optimization:
140+
Why manifests? Query optimisation:
141141

142142
- Scanner reads manifest, not S3 listing
143143
- Knows exact file count before scanning
@@ -233,7 +233,7 @@ FROM iceberg.raw.glucose_entries; -- Current
233233
- 🐛 Data quality issue today? Check what you ingested yesterday
234234
- Audit trail: see exactly what changed and when
235235
- Reproducibility: re-run yesterday's analysis with yesterday's data
236-
- ↩️ No "undo" button neededjust query the previous snapshot
236+
- No "undo" button needed - just query the previous snapshot
237237

238238
## ACID Transactions
239239

@@ -500,5 +500,5 @@ Phlo uses Iceberg to ensure:
500500

501501
## Next Steps
502502

503-
- Continue with [Part 4: Project NessieGit-Like Versioning for Data](04-project-nessie-versioning.md).
503+
- Continue with [Part 4: Project Nessie - Git-Like Versioning for Data](04-project-nessie-versioning.md).
504504
- See how Iceberg powers dbt models in [Part 6: dbt Transformations](06-dbt-transformations.md).

web/content/blog/04-project-nessie-versioning.md

Lines changed: 32 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: "Project Nessie — Git-Like Versioning for Data"
33
description: "Add branching, merging, and tagging to your data catalog with Project Nessie for safe experimentation and auditable changes."
44
---
55

6-
# Part 4: Project NessieGit-Like Versioning for Data
6+
# Part 4: Project Nessie - Git-Like Versioning for Data
77

88
> Prerequisite: Read [Part 3: Apache Iceberg Explained](03-apache-iceberg-explained.md) for table format basics.
99
@@ -17,7 +17,7 @@ description: "Add branching, merging, and tagging to your data catalog with Proj
1717
## Prerequisites
1818

1919
- [Part 3: Apache Iceberg Explained](03-apache-iceberg-explained.md)
20-
- Optional: [Part 2: Getting StartedSetup Guide](02-setup-guide.md) to run commands locally.
20+
- Optional: [Part 2: Getting Started - Setup Guide](02-setup-guide.md) to run commands locally.
2121

2222
Iceberg gave us time travel. Now let's add **branching**, **merging**, and **tags** to our data with Project Nessie.
2323
For governance workflows that build on Nessie history, see [Part 10: Metadata & Governance](10-metadata-governance.md).
@@ -38,9 +38,12 @@ git merge # Promote to main
3838

3939
Nessie brings this same workflow to **data**:
4040

41-
```mermaid
42-
flowchart LR
43-
A[main branch - stable, validated] -->|merge when ready| B[dev branch - experimental, testing]
41+
```
42+
main branch (production) dev branch (development)
43+
│ │
44+
│ ← stable, validated │ ← experimental, testing
45+
│ │
46+
└──── merge when ready ──────┘
4447
```
4548

4649
### Nessie Branching Flow (Diagram)
@@ -65,23 +68,29 @@ sequenceDiagram
6568

6669
Without versioning, data work looks like:
6770

68-
```mermaid
69-
flowchart TD
70-
A[Production Data] --> B[Dev transforms it]
71-
B --> C[Oops! Broke something]
72-
C --> D[Production Data CORRUPTED]
73-
D --> E[Lost today's data!]
71+
```
72+
Production Data
73+
74+
(Dev transforms it)
75+
76+
(Oops! Broke something)
77+
78+
Production Data is CORRUPTED
79+
80+
(Back up from last night? Lost today's data!)
7481
```
7582

7683
With Nessie:
7784

78-
```mermaid
79-
flowchart TD
80-
A[main - production] --> B[dev branch]
81-
B --> C[Test transformations]
82-
C --> D[Validate quality]
83-
D -->|If good, merge| A
84-
D -->|If bad, delete branch| E[main unchanged]
85+
```
86+
main (production)
87+
88+
├─ dev (development)
89+
│ └─ (Test transformations)
90+
│ └─ (Validate quality)
91+
│ └─ (If bad, delete branch, main unchanged)
92+
93+
└─ (If good, merge dev → main atomically)
8594
```
8695

8796
## Core Nessie Concepts
@@ -132,7 +141,7 @@ main:
132141
dev (branched from Commit B):
133142
├── Commit B': Quality fixes (inherited)
134143
├── Commit D: New transformations
135-
└── Commit E: Schema optimizations (HEAD)
144+
└── Commit E: Schema optimisations (HEAD)
136145
137146
Merge dev → main:
138147
├── Commit A: Initial data load
@@ -404,7 +413,7 @@ SELECT
404413
In dbt, select the target to use the appropriate catalog:
405414

406415
```yaml
407-
# workflows/transforms/dbt/profiles.yml
416+
# workflows/transforms/dbt/profiles/profiles.yml
408417

409418
phlo:
410419
outputs:
@@ -600,8 +609,8 @@ $ dbt run --select fct_glucose_readings
600609
$ phlo materialize fct_glucose_readings --partition 2024-01-15
601610

602611
# 4. Validate changes
603-
$ phlo contract validate glucose_readings
604-
$ phlo quality run silver.fct_glucose_readings
612+
$ phlo validate-schema workflows/schemas/glucose.py
613+
$ phlo catalog describe silver.fct_glucose_readings
605614

606615
# 5. Compare to main
607616
$ phlo branch diff main feature/add-a1c-calculation
@@ -801,4 +810,4 @@ See also: [Part 3: Apache Iceberg Explained](03-apache-iceberg-explained.md), [P
801810
- Learn how data is transformed on top of Nessie in [Part 6: dbt Transformations](06-dbt-transformations.md).
802811
- See governance workflows that build on branches in [Part 10: Metadata & Governance](10-metadata-governance.md).
803812

804-
**Next**: [Part 5: Data IngestionGetting Data Into the Lakehouse](05-data-ingestion.md)
813+
**Next**: [Part 5: Data Ingestion - Getting Data Into the Lakehouse](05-data-ingestion.md)

0 commit comments

Comments
 (0)