Stored raw data (parquet) → Preprocess → Store processed data → Load to PostgreSQL → Transform with dbt → Analyze
The environment can be initialized either using pip or uv.
A requirements.txt file is provided in order to install the required packages using pip. It is recommended to use a virtual environment
to avoid conflicts with other projects.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe case study also provides a pyproject.toml and uv.lock file to initialize the environment with uv.
uv sync
source .venv/bin/activateA template that contains all the required environment variables is provided in .env_template, which you can copy to a .env file and source it with the following commands:
set -a
source .env
set +aExported data ressembles our production system data (it has been scrambled and any ressemblance to actual MDPI data is purely coincidental).
The raw dataset include 2 files:
log_export.parquetcontains an extract of event logs.db_export.parquetcontains an extract of a database table.
Each process instance is identified by a manuscript_id that is unique
for each manuscript. The event logs looks like the following:
{
"manuscript_id": "00318206b58d24b8b53361ed3fa120a3",
"event_type": "submit_manuscript",
"timestamp": 1728894511,
}The raw dataset also contains an export of the metadata linked to event logs. Data from both files can be joined using the manuscript_id column.
The third part of the assessments includes the development of dbt models to provide analytical insights. The case study already provides the skeleton of a default dbt project.
If you are not familiar with dbt, you can check their sandox project on GitHub to get started. You can also check the dbt documentation for more information.
The project already provides a list of suggested dependencies (either in pyproject.toml or requirements.txt) that suffice to complete the assessment.