Skip to content

MDPI-AG/case-study-data-engineering-process

Repository files navigation

Case Study: Data Engineering (Process Mining)


⚠️ PLEASE DO NOT FORK THIS REPO AS OTHERS MAY SEE YOUR CODE. INSTEAD YOU CAN USE THE USE THIS TEMPLATE BUTTON TO CREATE YOUR OWN REPOSITORY.


Targeted Workflow

Stored raw data (parquet) → Preprocess → Store processed data → Load to PostgreSQL → Transform with dbt → Analyze

Getting Started

Python environment

The environment can be initialized either using pip or uv.

1. Using pip

A requirements.txt file is provided in order to install the required packages using pip. It is recommended to use a virtual environment to avoid conflicts with other projects.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Using uv

The case study also provides a pyproject.toml and uv.lock file to initialize the environment with uv.

uv sync
source .venv/bin/activate

Environment variables

A template that contains all the required environment variables is provided in .env_template, which you can copy to a .env file and source it with the following commands:

set -a
source .env
set +a

Data

Exported data ressembles our production system data (it has been scrambled and any ressemblance to actual MDPI data is purely coincidental).

The raw dataset include 2 files:

  • log_export.parquet contains an extract of event logs.
  • db_export.parquet contains an extract of a database table.

1. Event logs

Each process instance is identified by a manuscript_id that is unique for each manuscript. The event logs looks like the following:

{
  "manuscript_id": "00318206b58d24b8b53361ed3fa120a3",
  "event_type": "submit_manuscript",
  "timestamp": 1728894511,
}

2. Database export

The raw dataset also contains an export of the metadata linked to event logs. Data from both files can be joined using the manuscript_id column.

dbt

The third part of the assessments includes the development of dbt models to provide analytical insights. The case study already provides the skeleton of a default dbt project.

If you are not familiar with dbt, you can check their sandox project on GitHub to get started. You can also check the dbt documentation for more information.

Dependencies

The project already provides a list of suggested dependencies (either in pyproject.toml or requirements.txt) that suffice to complete the assessment.

About

This is a case study for Data Engineers in Process Mining.

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •