data-immo is a high-performance ETL pipeline built in Rust to efficiently extract, transform, and load real estate transaction data from the DVF+ API into a Lakehouse (Dremio).
The project is designed with a focus on performance, reliability, and scalability, leveraging modern data engineering tools and practices.
flowchart LR
A[**API** DVF+] -->|Extraction| B[**Rust**]
B -->|Transformation| C[**Rust**]
C -->|Loading| D[(**DuckDB**)]
D -->|Saving cleaned data| E{**dbt**}
E -->|Validation<br>& loading| F[(**Dremio**)]
subgraph Extraction [Extract]
A
end
subgraph TransformationRust [Transform]
B
C
end
subgraph LT [Load & Transform]
D
end
subgraph VD [Validate Data quality]
E
end
subgraph LakeHouse [LakeHouse]
F
end
subgraph Docker [Docker Container]
LakeHouse
end
style A fill:#000091,stroke:#ffffff,color:#ffffff,stroke-width:1px
style B fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px
style C fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px
style D fill:#fef242,stroke:#000000,color:#000000,stroke-width:1px
style E fill:#fc7053,stroke:#000000,color:#000000,stroke-width:1px
style F fill:#31d3db,stroke:#ffffff,color:#ffffff,stroke-width:1px
style Docker fill: #099cec
-
Data Extraction
- Fetches real estate transaction data from the DVF+ API.
- Handles API rate limiting, retries, and efficient pagination using Rust’s concurrency model.
-
Data Transformation
- Uses DuckDB to transform optimized Parquet data into structured, queryable formats.
- Rust is used for additional transformations, data enrichment, and performance-critical operations (I/O, etc.).
-
Data Validation & Loading
- dbt is used to validate, test, and model the data.
- The cleaned and validated data is loaded into Dremio, enabling a Lakehouse architecture.
- Rust → Core language for API calls, transformations, and performance optimization.
- DuckDB → In-process SQL engine for fast transformations of optimized Parquet datasets.
- dbt → Data modeling, testing, and validation layer.
- Dremio → Lakehouse platform for analytics and querying.
- Extract: Retrieve raw transaction data from DVF+ API.
- Stage: Store raw data as Parquet.
- Transform: Apply transformations using DuckDB and Rust.
- Validate & Model: Use dbt to ensure data quality and prepare final schemas.
- Load: Push validated datasets into Dremio for downstream analytics.