Case Study: Data Engineering (Process Mining)

⚠️ PLEASE DO NOT FORK THIS REPO AS OTHERS MAY SEE YOUR CODE. INSTEAD YOU CAN USE THE USE THIS TEMPLATE BUTTON TO CREATE YOUR OWN REPOSITORY.

Targeted Workflow

Stored raw data (parquet) → Preprocess → Store processed data → Load to PostgreSQL → Transform with dbt → Analyze

Getting Started

Python environment

The environment can be initialized either using pip or uv.

1. Using pip

A requirements.txt file is provided in order to install the required packages using pip. It is recommended to use a virtual environment to avoid conflicts with other projects.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Using uv

The case study also provides a pyproject.toml and uv.lock file to initialize the environment with uv.

uv sync
source .venv/bin/activate

Environment variables

A template that contains all the required environment variables is provided in .env_template, which you can copy to a .env file and source it with the following commands:

set -a
source .env
set +a

Data

Exported data ressembles our production system data (it has been scrambled and any ressemblance to actual MDPI data is purely coincidental).

The raw dataset include 2 files:

log_export.parquet contains an extract of event logs.
db_export.parquet contains an extract of a database table.

1. Event logs

Each process instance is identified by a manuscript_id that is unique for each manuscript. The event logs looks like the following:

{
  "manuscript_id": "00318206b58d24b8b53361ed3fa120a3",
  "event_type": "submit_manuscript",
  "timestamp": 1728894511,
}

2. Database export

The raw dataset also contains an export of the metadata linked to event logs. Data from both files can be joined using the manuscript_id column.

dbt

The third part of the assessments includes the development of dbt models to provide analytical insights. The case study already provides the skeleton of a default dbt project.

If you are not familiar with dbt, you can check their sandox project on GitHub to get started. You can also check the dbt documentation for more information.

Dependencies

The project already provides a list of suggested dependencies (either in pyproject.toml or requirements.txt) that suffice to complete the assessment.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
dbt		dbt
logs		logs
notebooks		notebooks
src		src
.env_template		.env_template
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Case Study: Data Engineering (Process Mining)

Targeted Workflow

Getting Started

Python environment

1. Using pip

2. Using uv

Environment variables

Data

1. Event logs

2. Database export

dbt

Dependencies

About

Uh oh!

Contributors 3

Uh oh!

Languages

MDPI-AG/case-study-data-engineering-process

Folders and files

Latest commit

History

Repository files navigation

Case Study: Data Engineering (Process Mining)

Targeted Workflow

Getting Started

Python environment

1. Using pip

2. Using uv

Environment variables

Data

1. Event logs

2. Database export

dbt

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages