A datalake architecture made only with FOSS tools. It was part of my bachelors thesis, not on work anymore.
This repository has been sanitized for public sharing. All proprietary company references, specific server names, IP addresses, and internal domain names have been replaced with generic placeholders (e.g., .example.com). This ensures the repository can be safely shared and used as a reference without exposing internal infrastructure details.
- Company names and references
- Specific hostnames and domain names (
.corpintra.net→.example.com) - IP addresses and server names
- Internal registry URLs
- Proxy configurations
- Certificate names in Kubernetes secrets
- Domain:
.example.com - Hostnames:
your-host.example.com,your-registry.example.com, etc. - IPs: Generic placeholders like
192.168.1.100 - Registry:
your-registry.example.com/your-org/your-repo
When deploying, replace these placeholders with your actual infrastructure details.
This repository is maintained by Matheus Pullig Soranço de Carvalho.
This repository is licensed under GNU Affero General Public License v3 (AGPL-3.0).
However, the included projects have their own licenses:
- Apache Airflow: Apache License 2.0
- Apache Bigtop: Apache License 2.0
- Apache Ranger: Apache License 2.0
- Apache Trino: Apache License 2.0
- DBT: Apache License 2.0
- Kubernetes: Apache License 2.0
- Livy: Apache License 2.0
Note: There is a potential license compatibility issue between AGPL-3.0 and Apache 2.0. The AGPL-3.0 is a strong copyleft license that requires derivative works to be licensed under the same license, while Apache 2.0 is permissive. Since this repository only aggregates unmodified FOSS projects and provides configuration/setup instructions, it should not create derivative works. However, for clarity and to avoid any potential issues, consider using this repository for reference only and comply with each project's individual license terms.