Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
7141861
Trying to add favicon... try 1
LienBosmans Sep 27, 2025
a02a012
Trying to add favicon ... try 2
LienBosmans Sep 27, 2025
39a37e5
Trying to add favicon ... try 3
LienBosmans Sep 27, 2025
19a2e5a
Trying to add logo to homepage ... try 1
LienBosmans Sep 27, 2025
d371aad
Trying to add logo to homepage ... try 2
LienBosmans Sep 27, 2025
88fa7a7
Trying to add logo to homepage ... try 3
LienBosmans Sep 27, 2025
eaf7167
Trying to add logo to homepage ... try 4
LienBosmans Sep 27, 2025
e5c7d2a
Playing around with size of logo
LienBosmans Sep 27, 2025
1696114
Playing around with size logo ... try 2
LienBosmans Sep 27, 2025
6775650
Playing around with size logo ... try 3 (quick & dirty)
LienBosmans Sep 27, 2025
878cd74
Playing around with size logo ... try 4 (quick & dirty)
LienBosmans Sep 27, 2025
b5ab607
I give up, size is fine
LienBosmans Sep 27, 2025
a4e7681
Does this work?
LienBosmans Sep 27, 2025
bfc498c
created assets folder
LienBosmans Sep 27, 2025
07cebe0
Trying to adjust the theme
LienBosmans Sep 27, 2025
d3e872c
Fixed broken logo link, tried to adjust logo height again, and added …
LienBosmans Sep 27, 2025
c575edf
Didn't like the resize
LienBosmans Sep 27, 2025
e783933
Switched description and logo
LienBosmans Sep 27, 2025
a3a6b27
Maybe max-height can fix things?
LienBosmans Sep 27, 2025
15b65f6
Link to demo
LienBosmans Sep 27, 2025
78e6eab
Whoops, forgot about proportions
LienBosmans Sep 27, 2025
f3bfcb6
More whitspace under logo
LienBosmans Sep 27, 2025
3cd81cd
Whoops, fixed proportions wrong
LienBosmans Sep 27, 2025
47e0794
Moved links to middle section
LienBosmans Sep 27, 2025
b15b9eb
Whoops, shouldn't have moved the links there. Added a link to PyPi.
LienBosmans Sep 27, 2025
b8f146f
Is this how you add extra whitespace after logo?
LienBosmans Sep 27, 2025
35b45d9
Add link to contributing guide
LienBosmans Sep 27, 2025
95a6fee
Remove extra margin at wrong spot
LienBosmans Sep 27, 2025
47be556
Switched links
LienBosmans Sep 27, 2025
c685584
Small changes to links
LienBosmans Sep 27, 2025
6df3beb
Restructure documentation
LienBosmans Sep 28, 2025
951c149
Added link to paper, removed how to view DuckDB files since that's in…
LienBosmans Sep 28, 2025
f0098d3
Removed enter
LienBosmans Sep 28, 2025
c7c792f
Split get_github_log documentation in function and output part
LienBosmans Sep 28, 2025
b9ceae1
Switched the links around, again
LienBosmans Sep 28, 2025
d93f61b
Explain how documentation is structured
LienBosmans Sep 28, 2025
dd8979f
Fixed wrong link
LienBosmans Sep 28, 2025
bd30f5a
Referred to how to view DuckDB files
LienBosmans Sep 28, 2025
7d47d81
Fixed link
LienBosmans Sep 28, 2025
65eca51
Fixed broken links
LienBosmans Sep 28, 2025
e61aea4
Fixed broken links
LienBosmans Sep 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 15 additions & 99 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,49 +33,51 @@ Do you have an idea to extend PyStack't with extra functionality? Awesome! Pleas
- We try to limit the number of external dependencies.
- For performant data transformations, please use [DuckDB](https://duckdb.org/docs/stable/) (SQL) or [Polars](https://docs.pola.rs/) (DataFrame).
- For (interactive) visualizations, please use [Matplotlib](https://matplotlib.org/) or [Dash](https://dash.plotly.com/).
- Every PyStack't function reads from or writes to a DuckDB database file that uses the [Stack't relational schema](#stackt-relational-schema).
- Every PyStack't function reads from or writes to a DuckDB database file that uses the [Stack't relational schema](/docs/content/explained/pystackt_design.md).
- New functionality must also be compatible with DuckDB files with the Stack't relational schema.
- If the Stack't relational schema does not fit your use-case, and you want to propose an improvement, please reach out directly to [Lien Bosmans](mailto:lienbosmans@live.com).


### Data extractors

Example: GitHub extractor [ [code](/src/pystackt/extractors/github/) | [docs](/docs/extract/get_github_log.md) ]
Example: GitHub extractor [ [code](/src/pystackt/extractors/github/) | [docs](/docs/content/reference/extract/get_github_log.md) ]

What is expected:
1. Choose a publicly available data source that contains real-life event data.
1. Figure out how the source data is structured, how the API works, ...
1. Map the data to the [Stack't relational schema](#stackt-relational-schema).
1. Map the data to the [Stack't relational schema](/docs/content/explained/pystackt_design.md).
1. Clean up your code. Save it in a new subfolder of [/src/pystackt/extractors/](/src/pystackt/extractors/).
- Re-use existing functionality when possible.
- Write modular functions.
- Include error handling.
- Use doc strings and in-line comments.
1. Test your code.
1. Write end-user documentation. Add it as a markdown file in the folder [/docs/extract/](/docs/extract/). The documentation should include
- code snippet with example
- table that explains all parameters of the function
- explanation on how to generate credentials to connect to the data source (if relevant)
- description of which data is extracted
- (link to) explanation of how the extracted data is allowed to be used
1. Write reference documentation. Add it as markdown files in the folder [/docs/content/reference/extract/](/docs/content/reference/extract/).
- The function documentation should include
- code snippet with example
- table that explains all parameters of the function
- explanation on how to generate credentials to connect to the data source (if relevant)
- (link to) explanation of how the extracted data is allowed to be used
- The output data documentation should include
- description of which data is extracted


### Data exporters

Example: OCEL 2.0 [ [code](/src/pystackt/exporters/ocel2/) | [docs](/docs/export/export_to_ocel2.md) ]
Example: OCEL 2.0 [ [code](/src/pystackt/exporters/ocel2/) | [docs](/docs/content/reference/export/export_to_ocel2.md) ]

Please note that the exported data format should be **object-centric** and **supported by at least one tool** (software, application, Python package, script, ...) that is open-source (*preferred*) or offers a free license for developpers / students / personal use.

What is expected:
1. Choose an object-centric event data format.
1. Map the [Stack't relational schema](#stackt-relational-schema) to your chosen data format.
1. Map the [Stack't relational schema](/docs/content/explained/pystackt_design.md) to your chosen data format.
1. Clean up your code. Save it in a new subfolder of [/src/pystackt/exporters/](/src/pystackt/exporters/).
- Re-use existing functionality when possible.
- Write modular functions.
- Include error handling.
- Use doc strings and in-line comments.
1. Test your code.
1. Write end-user documentation. Add it as a markdown file in the folder [/docs/export/](/docs/export/). The documentation should include
1. Write end-user documentation. Add it as a markdown file in the folder [/docs/content/reference/export/](/docs/content/reference/export/). The documentation should include
- code snippet with example
- table that explains all parameters of the function
- overview of any information loss that happens when exporting to this format
Expand All @@ -86,7 +88,7 @@ What is expected:
Data preparation is definitely more than simply extracting and exporting data, so we also welcome additional functionality that support activities like data exploration, data cleaning, data filtering, ...

The previously discussed items still apply:
1. Start from the [Stack't relational schema](#stackt-relational-schema) in a DuckDB file.
1. Start from the [Stack't relational schema](/docs/content/explained/pystackt_design.md) in a DuckDB file.
- If the Stack't relational schema does not work with the application you have in mind, include a function to prepare the data first. ([example](/src/pystackt/exploration/graph/data_prep/))
1. Clean up your code. Document your code. Test your code.
1. Write end-user documentation.
Expand All @@ -106,89 +108,3 @@ Simply create a pull request (PR)! Some good practices to consider:
- documentation of one function + code improvements of another function
- Write meaningful commit messages.
- Don't combine independent changes in the same commit.


## Stack't relational schema

The Stack't relational schema describes how to store object-centric event data in a relational database using a fixed set of tables and table columns. This absence of any schema changes makes the format well-suited to act as a central data hub, enabling the modular design of PyStack't.

An overview of the tables and columns is included in this document. For more information on the design choices and the proof-of-concept implementation [Stack't](https://github.com/LienBosmans/stack-t), we recommend reading the paper [Dynamic and Scalable Data Preparation for Object-Centric Process Mining](https://arxiv.org/abs/2410.00596).

![PyStack't has a modular design.](/docs/pystackt_architecture.png)

**Event-related tables**. To maintain flexibility and support dynamic changes, event types and their attribute definitions are stored in rows rather than being defined by table and column names. This approach enables the use of the exact same tables across all processes, reducing the impact of schema modifications. Changing an event type involves updating foreign keys rather than moving data to different tables, and attributes can be added or removed without altering the schema.
- Table `event_types` contains an entry for each unique event type. \
Columns:
- `id` is the primary key.
- `description` should be human-readable.
- Table `event_attributes` stores entries for each unique event attribute. \
Columns:
- `id` is the primary key.
- `event_type_id` is a foreign key referencing table `event_types`.
- `description` should be human-readable.
- `datatype` of the attribute (integer, varchar, timestamp, ...) of the attribute.
- Table `events` records details for each event. \
Columns:
- `id` is the primary key.
- `event_type_id` is a foreign key referencing table `event_types`.
- `timestamp`, preferably using UTC time zone.
- `description` should be human-readable.
- Table `event_attribute_values` stores all attribute values for different events. This setup decouples events and their attributes by storing each attribute value in a new row, facilitating support for late-arriving data points. \
Columns:
- `id` is the primary key.
- `event_id` is a foreign key referencing table `events`.
- `event_attribute_id` is a foreign key referencing table `event_attributes`.
- `attribute_value` is the value of the attribute. This value should match the datatype of the attribute.

**Object-related tables** also leverage row-based storage to manage attributes independently. This approach reduces the number of duplicate or NULL values significantly when attributes are updated asynchronously and frequently.
- Table `object_types` records entries for each unique object type.\
Columns:
- `id` is the primary key.
- `description` should be human-readable
- Table `object_attributes` contains entries for each unique object attribute. \
Columns:
- `id` is the primary key.
- `object_type_id` is a foreign key referencing table `object_types`.
- `description` should be human-readable.
- `datatype` (integer, varchar, timestamp, ...) of the attribute.
- Table `object` stores details for each object.\
Columns:
- `id` is the primary key.
- `object_type_id` is a foreign key referencing table `object_types`.
- `description` should be human-readable.
- Table `object_attribute_values` records attribute values for objects.\
Columns:
- `id` is the primary key.
- `object_id` is a foreign key referencing table `objects`.
- `object_attribute_id` is a foreign key referencing table `object_attributes`.
- `timestamp` indicates when the attribute was updated. Timestamps are preferably stored using the UTC time zone.
- `attribute_value` is the updated value of the attribute. This value should match the datatype of the attribute.

**Relation-related tables** serve as bridging tables to manage the different many-to-many relations between events and objects. The qualifier definitions are stored separately to minimize the impact of renaming them in case of changing business requirements
- Table `relation_qualifiers` stores qualifier definitions. In cases where relation qualifiers are not available in the source data, a dummy qualifier can be introduced.\
Columns
- `id` is the primary key.
- `description` should be human-readable.
- `datatype` (integer, varchar, timestamp, ...) of the attribute.
- Table `object_to_object` stores (dynamic) relations between objects.\
Columns:
- `id` is the primary key.
- `source_object_id` is a foreign key referencing table `objects`.
- `target_object_id` is a foreign key referencing table `objects`.
- `timestamp` indicates when the relationship became active. To signify the end of an object-to-object relationship, a NULL value is used for the qualifier value, rather than an end timestamp. This design choice facilitates append-only data ingestion. Timestamps are preferably stored using the UTC time zone.
- `qualifier_id` is a foreign key referencing table `qualifiers`.
- `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
- Table `event_to_object` stores relations between events and objects.\
Columns:
- `id` is the primary key.
- `event_id` is a foreign key referencing table `events`.
- `object_id` is a foreign key referencing table `objects`.
- `qualifier_id` is a foreign key referencing table `qualifiers`.
- `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
- Table `event_to_object_attribute_value` stores relations between events and changes to object attributes.\
Columns:
- `id` is the primary key.
- `event_id` is a foreign key referencing table `events`.
- `object_attribute_value_id` is a foreign key referencing table `object_attribute_values`.
- `qualifier_id` is a foreign key referencing table `qualifiers`.
- `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
23 changes: 5 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,28 +9,15 @@ PyStack't is published on [PyPi](https://pypi.org/project/pystackt/) and can be
pip install pystackt
```

## [📖 Documentation](https://lienbosmans.github.io/pystackt/)
## 📖 Documentation

- [Extensive documentation](https://lienbosmans.github.io/pystackt/) is available via GitHub pages.
- A [demo video on Youtube](https://youtu.be/AS8wI90wRM8) can walk you throught the different functionalities.

## 🔍 Viewing Data
PyStack't creates **DuckDB database files**. From DuckDB version 1.2.1 onwards, you can explore them using the [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html). Below code will load the UI by navigating to `http://localhost:4213` in your default browser.

```python
import duckdb

with duckdb.connect("./stackt.duckdb") as quack:
quack.sql("CALL start_ui()")
input("Press Enter to close the connection...")
```

Alternatively, you can use a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.

- Our BPM 2025 demo paper [PyStack't: Real-Life Data for Object-Centric Process Mining](https://ceur-ws.org/Vol-4032/paper-28.pdf) is available on CEUR.

## 📝 Examples

### ⛏️🐙 Extract object-centric event log from GitHub repo ([`get_github_log`](https://lienbosmans.github.io/pystackt/extract/get_github_log.html))
### ⛏️🐙 Extract object-centric event log from GitHub repo ([`get_github_log`](https://lienbosmans.github.io/pystackt/content/reference/extract/get_github_log.html)
```python
from pystackt import *

Expand All @@ -44,7 +31,7 @@ get_github_log(
)
```

### 📈 Interactive data exploration ([`start_visualization_app`](https://lienbosmans.github.io/pystackt/exploration/interactive_data_visualization_app.html))
### 📈 Interactive data exploration ([`start_visualization_app`](https://lienbosmans.github.io/pystackt/content/reference/exploration/interactive_data_visualization_app.html))

```python
from pystackt import *
Expand All @@ -61,7 +48,7 @@ start_visualization_app(
)
```

### 📤 Export to OCEL 2.0 ([`export_to_ocel2`](https://lienbosmans.github.io/pystackt/export/export_to_ocel2.html))
### 📤 Export to OCEL 2.0 ([`export_to_ocel2`](https://lienbosmans.github.io/pystackt/content/reference/export/export_to_ocel2.html)
```python
from pystackt import *

Expand Down
64 changes: 21 additions & 43 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,30 @@
# PyStack't Documentation

PyStack't (`pip install pystackt`) is a Python package that supports data preparation for object-centric process mining. It covers extraction of object-centric event data, storage of that data, (visual) data exploration, and export to OCED formats.
PyStack't is a Python package that supports data preparation for object-centric process mining. It covers extraction of object-centric event data, storage of that data, (visual) data exploration, and export to popular OCED formats.

[Source code](https://github.com/LienBosmans/pystackt) | [PyPi](https://pypi.org/project/pystackt/) | [Contributing Guide](https://github.com/LienBosmans/pystackt/blob/main/CONTRIBUTING.md)
The documentation is structured in four different parts:
- [Tutorials](#-tutorials-start-here): hands-on lessons for beginners
- [Reference material](#-reference-material): technical descriptions
- [How-to guides](#-how-to-guides): practical directions
- [Behind-the-scenes](#-behind-the-scenes): context and background

## 📚 Tutorials (start here)

## Data Storage
- [Extracting your first object-centric event log from a GitHub repository](content/tutorials/tutorial_extracting_OCED.md)

PyStack't uses the Stack't relational schema to store object-centric event data. This schema was created specifically to support the data preparation stage, taking into account data engineering best practices. For more information on the design of Stack't, we recommend the paper [Dynamic and Scalable Data Preparation for Object-Centric Process Mining](https://arxiv.org/abs/2410.00596).
## 📖 Reference material
### Functions
- [⛏️ get_github_log](content/reference/extract/get_github_log.md)
- [📤 export_to_ocel2](content/reference/export/export_to_ocel2.md)
- [📤 export_to_promg](content/reference/export/export_to_promg.md)
- [📈 create_statistics_views](content/reference/exploration/create_statistics_views.md)
- [📈 interactive data visualization app](content/reference/exploration/interactive_data_visualization_app.md)

![PyStack't has a modular design.](/docs/pystackt_architecture.png)
### Output data
- [🗺️ Overview of `get_github_log` output](content/reference/extract/github_OCED.md)

While any relational database can be used to store data in the Stack't relational schema, PyStack't uses [DuckDB](https://duckdb.org/) because it's open-source, fast and simple to use. (Think SQLite but for analytical workloads.)
## ❓ How-to guides
- [How to view DuckDB files?](content/howto/view_duckdb_files.md)

From DuckDB version 1.2.1 onwards, you can explore them using the [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html). Below code will load the UI by navigating to `http://localhost:4213` in your default browser.

```python
import duckdb

with duckdb.connect("./stackt.duckdb") as quack:
quack.sql("CALL start_ui()")
input("Press Enter to close the connection...")
```

Alternatively, you can use a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.


## Data extraction

Extracting data from different systems is an important part of data preparation. While PyStack't does not include all functionality that a data stack offers (incremental ingests, scheduling refreshes, monitoring data pipelines...), it aims to provide simple-to-use methods to get real-life data for your object-centric process mining adventures.

### ⛏️ List of data extraction functionality
- [`get_github_log`](extract/get_github_log.md)


## Data export

The Stack't relational schema is intended as an intermediate storage hub. PyStack't provides export functionality to export the data to specific OCED formats that can be used by process mining applications and algorithms. This decoupled set-up has as main advantage that any future data source can be exported to all supported data formats, and any future OCED format can be combined with existing data extraction functionality.

### 📤 List of data export functionality
- [`export_to_ocel2`](export/export_to_ocel2.md)
- [`export_to_promg`](export/export_to_promg.md)


## Data exploration

Dispersing process data across multiple tables makes exploring object-centric event data less straightforward compared to traditional process mining. PyStack't aims to bridge this gap by providing dedicated data exploration functionality. Notably, the latest release includes an interactive data exploration app that runs locally and works out-of-the-box with any OCED data structured in the Stack't relational schema.

### 📈 List of data exploration functionality
- [`create_statistics_views`](exploration/create_statistics_views.md)
- [`interactive data visualization app`](exploration/interactive_data_visualization_app.md)
## 💡 Behind-the-scenes
- [About the design of PyStack't](content/explained/pystackt_design.md)
17 changes: 17 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
theme: jekyll-theme-minimal

title: PyStack't Documentation
description: "Real-life data for object-centric processing mining"
logo: /assets/images/pystackt_logo_black_circle_small.png

paper_url: https://ceur-ws.org/Vol-4032/paper-28.pdf
paper_title: "L. Bosmans, J. Peeperkorn, J. De Smedt, PyStack’t: Real-Life Data for Object-Centric Process Mining"

demo_url: https://www.youtube.com/watch?v=AS8wI90wRM8&feature=youtu.be
demo_title: "PyStack't Demo BPM 2025"

pypi_url: https://pypi.org/project/pystackt/
pypi_title: "pip install pystackt"

contributing_url: https://github.com/LienBosmans/pystackt/blob/main/CONTRIBUTING.md
contributing_title: "Contributing guide"
Loading