A highly curated list of useful resources, repositories, blogs, and books for data engineering.
- 🤖 Agentic Coding
- 📚 Books
- 📰 Blogs
- 🔄 Data Plattform Tools
- 🧱 Databricks
- ⚙️ DevOps & CI/CD
- 📑 Handbooks & Guides
- 🔐 Data Privacy & Governance
- 📊 Reports
- 🧪 Testing
- 💡 Useful Code Snippets
- 💸 FinOps & Cost Management
Agentic coding patterns, tools, and resources for building AI-driven and autonomous systems.
Open-source repositories and frameworks for agentic coding and AI applications.
-
Dropped - Open-source iOS project and resource for learning about agentic coding patterns.
-
Generative AI for Beginners - A 21-lesson course by Microsoft teaching the fundamentals of building Generative AI applications, including hands-on code samples in Python and TypeScript.
-
GitHub Copilot Vibe Coding Workshop - Self-paced workshop for building applications using GitHub Copilot Agent Mode, with multi-language samples and containerization.
-
Graphiti - Framework for building real-time, temporally-aware knowledge graphs for AI agents, supporting dynamic data integration and hybrid search.
-
MarkItDown - Python tool for converting files and office documents to Markdown, designed for LLM and text analysis pipelines.
-
Awesome Copilot - Curated list of resources, tools, and projects related to GitHub Copilot and its ecosystem. A great starting point for exploring Copilot-powered development. 🧑💻🤖
Guides and tools for prompt engineering and memory management in LLM and agentic workflows.
- Cline Memory Bank Documentation - Official documentation on where and how memory bank files are stored and managed in Cline.
- OpenAI Cookbook: GPT-4 Prompting Guide - Practical guide and examples for effective prompting with GPT-4, from the OpenAI Cookbook.
Recommended books on data engineering, architecture, and software best practices.
- Building Medallion Architectures - Comprehensive guide to building medallion data architectures.
- Deciphering Data Architectures - Explains modern data architecture patterns and best practices.
- Designing Data-Intensive Applications - In-depth exploration of data systems, scalability, and reliability.
- Fundamentals of Data - Essential concepts and principles for working with data.
- The Pragmatic Programmer - Classic book on software engineering best practices.
- The Staff Engineers Path - Book on the role, responsibilities, and career path of staff engineers in modern software organizations.
Blogs and publications covering data engineering, analytics, and technology trends.
- Confessions of a Data Guy - Blog on real-world data engineering.
- Data Engineering Blog (Simon Spaethi) - Technical blog by Simon Spaethi.
- Data Engineering Weekly - Weekly blog on data engineering trends and best practices.
- DLT Hub Blog - Blog for DLT Hub, covering data loading and transformation topics.
- DuckDB Blog - Official blog for DuckDB, an in-process SQL OLAP database management system.
- dbt Developer Hub Blog - Blog for dbt (data build tool) developers, featuring updates, tutorials, and best practices.
- Marvelous MLOps - Medium - Medium publication focused on MLOps topics and best practices.
- MotherDuck Blog - Blog for MotherDuck, a DuckDB-based analytics platform.
- The GitHub Blog - Official GitHub company blog.
- Thoughtworks Insights - Thoughtworks' insights and technology trends blog.
- Visual Studio Code Blog - Official blog for Visual Studio Code, code editing, and development tips.
- Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data - Deep dive into how Discord scaled dbt to process petabytes of data, including custom solutions for multi-developer workflows, performance, and CI/CD guardrails.
Open-source tools and frameworks for building and managing data platforms.
- Apache DataFusion Comet - Query acceleration for Apache DataFusion.
- Apache Polaris - Snowflake Iceberg Catalog
- astronomer/astronomer-cosmos - Tools and integrations for running dbt projects in Apache Airflow.
- dbt-checkpoint - Linting and checks for dbt projects.
- dbt-coves - Workflow automation for dbt projects.
- dbt-labs/dbt-core - Core dbt framework for analytics engineering.
- dlt-hub/dlt - Data loading and transformation library for Python.
- duckdb/dbt-duckdb - dbt adapter for DuckDB, enabling analytics engineering workflows on DuckDB databases.
- DuckLake - Lakehouse implementation for DuckDB.
- Koheesio - Orchestration framework for building data pipelines.
- Lakehouse Engine - Lakehouse engine for scalable analytics by Adidas.
- VSCode dbt Power User - VSCode extension for enhanced dbt development.
Resources, tools, and blogs for working with Databricks and the modern data stack.
Official and community Databricks-related repositories and tools.
- databricks/cli - Official Databricks CLI for automation and scripting.
- databricks/databricks-vscode - Visual Studio Code extension for Databricks development.
- databricks/dbt-databricks - dbt adapter for Databricks, enabling analytics engineering workflows on Databricks.
- databrickslabs/discoverx - Data discovery and cataloging tool for Databricks.
- databrickslabs/dqx - Data quality framework for Databricks and Spark.
- databricks/terraform-provider-databricks - Terraform provider for Databricks infrastructure automation.
- Terraform Databricks SRA - Terraform modules and examples for implementing Databricks Security Reference Architecture.
- UnityCatalog - Open-source implementation of Unity Catalog for data governance.
Helpful links, code snippets, and resources for Databricks users.
- DBSQL SME Resources - Resources and tools for Databricks SQL subject matter experts.
- Databricks Playground - A collection of Databricks notebooks and resources for experimenting and learning.
- Retrying dbt Runs in Databricks Workflows - Databricks blog post on strategies for retrying dbt runs in workflows.
Blogs and publications focused on Databricks and its ecosystem.
- Databricks AI - Medium - Medium publication for AI topics on Databricks.
- Databricks Blog - Official Databricks company blog.
- Databricks Community Blog - Technical articles and updates from the Databricks community.
- Databricks DBSQL SME - Medium - Medium publication for Databricks SQL SME engineering topics.
- Databricks Platform SME - Medium - Medium publication for Databricks Platform subject matter experts.
- Databricks SQL SME on Medium - Medium publication for Databricks SQL SME.
- Databricks UC SME - Medium - Medium publication focused on Databricks Unity Catalog subject matter expertise.
DevOps tools, CI/CD automation, and infrastructure resources for data engineering.
- Commitizen CLI - Tool for creating conventional commit messages and automating releases.
- Copier - Project templating tool for generating and maintaining codebases.
- Inspect Docker Images - Tool for inspecting Docker images for vulnerabilities and metadata.
- VSCode with Podman Desktop - Guide to integrating VSCode with Podman Desktop for container development.
Comprehensive handbooks, guides, and documentation frameworks for data and engineering teams.
- Diátaxis - Framework for organizing technical documentation by user needs, focusing on tutorials, how-to guides, reference, and explanation.
- GitLab Enterprise Data Handbook - GitLab's official handbook for enterprise data management and governance.
- Modern Data Engineering Playbook - Comprehensive guide on modern data engineering practices and principles.
- Kimball Dimensional Modeling Techniques - 📊 Classic reference PDF from the Kimball Group summarizing dimensional modeling techniques for data warehousing and business intelligence projects.
- Python Patterns Guide - 🐍 Comprehensive guide to Python programming patterns, including design patterns, idioms, and best practices for writing clean, maintainable Python code.
Resources and tools for data privacy, security, and synthetic data generation.
- DataContract CLI - CLI tool for managing data contracts in data engineering workflows.
- DataHub Project - Metadata platform for the modern data stack.
- Presidio - Open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) in text, images, and structured data.
- SDV: Synthetic Data Vault - Python library for generating synthetic tabular data using machine learning, with tools for evaluation, anonymization, and quality reporting.
Industry reports, playbooks, and technology trend analyses for data engineering.
- Looking Glass 2025 - Thoughtworks' long-term technology trend report exploring 90+ trends and their business impact, with strategic recommendations.
Testing tools, frameworks, and best practices for data and software engineering.
- Inline Snapshot - Tool for inline snapshot testing in Python.
- Practical Test Pyramid - Martin Fowler's article on the test pyramid and testing strategies.
- Pytest Basics - Examples and explanations for getting started with pytest.
- Test Desiderata - Philosophical and practical guidance for software testing.
Handy code snippets and example repositories for data engineering tasks.
- Building Medallion Architectures Book Repo - The Repo to the book with useful code snippets.
- The Hitchhiker's Guide to dbt - A comprehensive guide and resource collection for working with dbt (data build tool).
Open-source tools and resources for cloud financial operations, cost management, and FinOps best practices.
- FinOps Toolkit - Microsoft open-source toolkit for automating and extending FinOps capabilities in the Microsoft Cloud, including starter kits, automation scripts, and best practices.