dream-stack/README at main · Rafno/dream-stack · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# Dream Stack for Data Engineering

This document outlines a curated stack for a modern data engineering workflow. The tools are categorized by their functionality, focusing on popular and powerful options for each part of the data lifecycle.

---

## **ORCHESTRATION**
- [Apache Airflow](https://airflow.apache.org/)
- [Dagster](https://dagster.io/)
- [Prefect](https://prefect.io/)

---

## **METADATA MANAGEMENT**
- [OpenMetadata](https://open-metadata.org/)
- [Amundsen](https://www.amundsen.io/)
- [DataHub](https://datahubproject.io/)

---

## **TRANSFORMATION**
- [dbt (Data Build Tool)](https://www.getdbt.com/)
- [SQLMesh](https://sqlmesh.com/)
- [Apache Spark](https://spark.apache.org/)

---

## **DATA INTEGRATION**
- [SLING](https://github.com/apache/incubator-sling)
- [Airbyte](https://airbyte.io/)
- [Fivetran](https://fivetran.com/)
- [Meltano](https://meltano.com/)
- [DLTHub](https://dlthub.com) -- Personal favorite

---

## **DATA STORAGE**
- Relational Databases:
  - [PostgreSQL](https://www.postgresql.org/)
  - [MySQL](https://www.mysql.com/)
- Data Warehouses:
  - [Snowflake](https://www.snowflake.com/)
  - [Google BigQuery](https://cloud.google.com/bigquery)
- Lakehouse Storage:
  - [Databricks](https://www.databricks.com/)
  - [Apache Hudi](https://hudi.apache.org/)
  - [Delta Lake](https://delta.io/)
- Object Storage:
  - [MinIO](https://min.io/) – An open-source, high-performance distributed object storage that is compatible with Amazon S3 APIs, often used for storing unstructured data.
  - Edit, Minio stopped being maintainced https://github.com/minio/minio/commit/27742d469462e1561c776f88ca7a1f26816d69e2
- **Query Engines**:
  - [Trino](https://trino.io/) (formerly Presto) – A fast and scalable distributed SQL query engine for analytics, perfect for querying large datasets across multiple sources with high performance.

---

## **STREAMING & REAL-TIME PROCESSING**
- [Apache Kafka](https://kafka.apache.org/)
- [Apache Flink](https://flink.apache.org/)
- [Redpanda](https://redpanda.com/)

---

## **DATA CATALOG & DISCOVERY**
- [Alation](https://www.alation.com/)
- [Collibra](https://www.collibra.com/)
- [Monte Carlo](https://www.montecarlodata.com/)

---

## **WORKFLOW AUTOMATION**
- [Apache NiFi](https://nifi.apache.org/)
- [Luigi](https://github.com/spotify/luigi)
- [Argo Workflows](https://argoproj.github.io/)

---

## **MONITORING & OBSERVABILITY**
- [Great Expectations](https://greatexpectations.io/) (Data Quality)
- [WhyLabs](https://whylabs.ai/) (ML/AI Monitoring)
- [Prometheus + Grafana](https://prometheus.io/)

---

## **DATA VISUALIZATION**
- [Tableau](https://www.tableau.com/)
- [Power BI](https://powerbi.microsoft.com/)
- [Metabase](https://www.metabase.com/)

---

## **MACHINE LEARNING**
- **Frameworks**:
  - [TensorFlow](https://www.tensorflow.org/)
  - [PyTorch](https://pytorch.org/)
  - [Scikit-learn](https://scikit-learn.org/stable/)
  - [XGBoost](https://xgboost.readthedocs.io/en/stable/)
  - [LightGBM](https://lightgbm.readthedocs.io/en/latest/)

- **ML Operations & Experiment Tracking**:
  - [MLflow](https://mlflow.org/)
  - [Kubeflow](https://www.kubeflow.org/)
  - [Weights & Biases](https://www.wandb.com/)

- **AutoML**:
  - [H2O.ai](https://www.h2o.ai/)
  - [Google AutoML](https://cloud.google.com/automl)
  - [Auto-sklearn](https://automl.github.io/auto-sklearn/)

---

## **Special mentions**
- [ByteBase](https://www.bytebase.com/docs/introduction/what-is-bytebase/) – A database schema change and version control tool for DevOps teams.
- [LakeFS](https://lakefs.io/) – A data version control platform for managing data lakes with Git-like semantics for your data.

---

## **Good Combos**
Here are some "good combos" of tools to consider when starting a new project:

- **Airflow + dbt + PostgreSQL**
  Ideal for traditional ETL workflows. Airflow handles orchestration, dbt manages data transformations, and PostgreSQL stores transactional data.

- **Prefect + SQLMesh + Snowflake**
  Great for modern data pipelines with a focus on orchestration (Prefect), transformation (SQLMesh), and scalable analytics (Snowflake).

- **Apache Kafka + Spark + Databricks**
  Perfect for real-time streaming data and large-scale data processing. Kafka is used for real-time messaging, Spark for distributed processing, and Databricks for managing the entire pipeline.

- **Airbyte + Snowflake + Tableau**
  A solid choice for integrating data from various sources with Airbyte, storing it in Snowflake, and visualizing it using Tableau.

- **MLflow + PyTorch + PostgreSQL**
  A combination focused on machine learning workflows. MLflow helps with experiment tracking and model management, PyTorch for deep learning, and PostgreSQL for storing metadata and results.

- **Fivetran + BigQuery + Looker**
  A strong choice for businesses looking for a fully managed pipeline. Fivetran handles integrations, BigQuery stores your data in the cloud, and Looker provides analytics and reporting.

- **Trino + Snowflake + Tableau**
  A great combo for querying and analyzing large datasets across multiple data sources. Trino allows fast SQL querying on various data storage systems, Snowflake acts as a scalable data warehouse, and Tableau provides insightful visualizations.

---

This stack is modular and can be customized based on specific use cases and organizational needs. Explore and adapt these tools to create a streamlined, efficient data engineering pipeline!