Skip to content

A highly curated list of awesome things related to data engineering and development

Notifications You must be signed in to change notification settings

BenSchr/awesome-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome

Awesome Data Engineering Resources

A highly curated list of useful resources, repositories, blogs, and books for data engineering.

📑 Table of Contents

🤖 Agentic Coding

Agentic coding patterns, tools, and resources for building AI-driven and autonomous systems.

📚 Repos

Open-source repositories and frameworks for agentic coding and AI applications.

  • Dropped - Open-source iOS project and resource for learning about agentic coding patterns.

  • Generative AI for Beginners - A 21-lesson course by Microsoft teaching the fundamentals of building Generative AI applications, including hands-on code samples in Python and TypeScript.

  • GitHub Copilot Vibe Coding Workshop - Self-paced workshop for building applications using GitHub Copilot Agent Mode, with multi-language samples and containerization.

  • Graphiti - Framework for building real-time, temporally-aware knowledge graphs for AI agents, supporting dynamic data integration and hybrid search.

  • MarkItDown - Python tool for converting files and office documents to Markdown, designed for LLM and text analysis pipelines.

  • Awesome Copilot - Curated list of resources, tools, and projects related to GitHub Copilot and its ecosystem. A great starting point for exploring Copilot-powered development. 🧑‍💻🤖

🧠 Prompt Engineering & Memory Bank

Guides and tools for prompt engineering and memory management in LLM and agentic workflows.

📚 Books

Recommended books on data engineering, architecture, and software best practices.

📰 Blogs

Blogs and publications covering data engineering, analytics, and technology trends.

🔄 Data Plattform Tools

Open-source tools and frameworks for building and managing data platforms.

🧱 Databricks

Resources, tools, and blogs for working with Databricks and the modern data stack.

📚 Repos

Official and community Databricks-related repositories and tools.

💡 Useful links & snippets

Helpful links, code snippets, and resources for Databricks users.

📰 Blogs

Blogs and publications focused on Databricks and its ecosystem.

⚙️ DevOps & CI/CD

DevOps tools, CI/CD automation, and infrastructure resources for data engineering.

  • Commitizen CLI - Tool for creating conventional commit messages and automating releases.
  • Copier - Project templating tool for generating and maintaining codebases.
  • Inspect Docker Images - Tool for inspecting Docker images for vulnerabilities and metadata.
  • VSCode with Podman Desktop - Guide to integrating VSCode with Podman Desktop for container development.

📑 Handbooks & Guides

Comprehensive handbooks, guides, and documentation frameworks for data and engineering teams.

  • Diátaxis - Framework for organizing technical documentation by user needs, focusing on tutorials, how-to guides, reference, and explanation.
  • GitLab Enterprise Data Handbook - GitLab's official handbook for enterprise data management and governance.
  • Modern Data Engineering Playbook - Comprehensive guide on modern data engineering practices and principles.
  • Kimball Dimensional Modeling Techniques - 📊 Classic reference PDF from the Kimball Group summarizing dimensional modeling techniques for data warehousing and business intelligence projects.
  • Python Patterns Guide - 🐍 Comprehensive guide to Python programming patterns, including design patterns, idioms, and best practices for writing clean, maintainable Python code.

🔐 Data Privacy & Governance

Resources and tools for data privacy, security, and synthetic data generation.

  • DataContract CLI - CLI tool for managing data contracts in data engineering workflows.
  • DataHub Project - Metadata platform for the modern data stack.
  • Presidio - Open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) in text, images, and structured data.
  • SDV: Synthetic Data Vault - Python library for generating synthetic tabular data using machine learning, with tools for evaluation, anonymization, and quality reporting.

📊 Reports

Industry reports, playbooks, and technology trend analyses for data engineering.

  • Looking Glass 2025 - Thoughtworks' long-term technology trend report exploring 90+ trends and their business impact, with strategic recommendations.

🧪 Testing

Testing tools, frameworks, and best practices for data and software engineering.

💡 Useful Code Snippets

Handy code snippets and example repositories for data engineering tasks.

💸 FinOps & Cost Management

Open-source tools and resources for cloud financial operations, cost management, and FinOps best practices.

  • FinOps Toolkit - Microsoft open-source toolkit for automating and extending FinOps capabilities in the Microsoft Cloud, including starter kits, automation scripts, and best practices.

About

A highly curated list of awesome things related to data engineering and development

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages