diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1aed50c..d7c1470 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -33,49 +33,51 @@ Do you have an idea to extend PyStack't with extra functionality? Awesome! Pleas
- We try to limit the number of external dependencies.
- For performant data transformations, please use [DuckDB](https://duckdb.org/docs/stable/) (SQL) or [Polars](https://docs.pola.rs/) (DataFrame).
- For (interactive) visualizations, please use [Matplotlib](https://matplotlib.org/) or [Dash](https://dash.plotly.com/).
-- Every PyStack't function reads from or writes to a DuckDB database file that uses the [Stack't relational schema](#stackt-relational-schema).
+- Every PyStack't function reads from or writes to a DuckDB database file that uses the [Stack't relational schema](/docs/content/explained/pystackt_design.md).
- New functionality must also be compatible with DuckDB files with the Stack't relational schema.
- If the Stack't relational schema does not fit your use-case, and you want to propose an improvement, please reach out directly to [Lien Bosmans](mailto:lienbosmans@live.com).
### Data extractors
-Example: GitHub extractor [ [code](/src/pystackt/extractors/github/) | [docs](/docs/extract/get_github_log.md) ]
+Example: GitHub extractor [ [code](/src/pystackt/extractors/github/) | [docs](/docs/content/reference/extract/get_github_log.md) ]
What is expected:
1. Choose a publicly available data source that contains real-life event data.
1. Figure out how the source data is structured, how the API works, ...
-1. Map the data to the [Stack't relational schema](#stackt-relational-schema).
+1. Map the data to the [Stack't relational schema](/docs/content/explained/pystackt_design.md).
1. Clean up your code. Save it in a new subfolder of [/src/pystackt/extractors/](/src/pystackt/extractors/).
- Re-use existing functionality when possible.
- Write modular functions.
- Include error handling.
- Use doc strings and in-line comments.
1. Test your code.
-1. Write end-user documentation. Add it as a markdown file in the folder [/docs/extract/](/docs/extract/). The documentation should include
- - code snippet with example
- - table that explains all parameters of the function
- - explanation on how to generate credentials to connect to the data source (if relevant)
- - description of which data is extracted
- - (link to) explanation of how the extracted data is allowed to be used
+1. Write reference documentation. Add it as markdown files in the folder [/docs/content/reference/extract/](/docs/content/reference/extract/).
+ - The function documentation should include
+ - code snippet with example
+ - table that explains all parameters of the function
+ - explanation on how to generate credentials to connect to the data source (if relevant)
+ - (link to) explanation of how the extracted data is allowed to be used
+ - The output data documentation should include
+ - description of which data is extracted
### Data exporters
-Example: OCEL 2.0 [ [code](/src/pystackt/exporters/ocel2/) | [docs](/docs/export/export_to_ocel2.md) ]
+Example: OCEL 2.0 [ [code](/src/pystackt/exporters/ocel2/) | [docs](/docs/content/reference/export/export_to_ocel2.md) ]
Please note that the exported data format should be **object-centric** and **supported by at least one tool** (software, application, Python package, script, ...) that is open-source (*preferred*) or offers a free license for developpers / students / personal use.
What is expected:
1. Choose an object-centric event data format.
-1. Map the [Stack't relational schema](#stackt-relational-schema) to your chosen data format.
+1. Map the [Stack't relational schema](/docs/content/explained/pystackt_design.md) to your chosen data format.
1. Clean up your code. Save it in a new subfolder of [/src/pystackt/exporters/](/src/pystackt/exporters/).
- Re-use existing functionality when possible.
- Write modular functions.
- Include error handling.
- Use doc strings and in-line comments.
1. Test your code.
-1. Write end-user documentation. Add it as a markdown file in the folder [/docs/export/](/docs/export/). The documentation should include
+1. Write end-user documentation. Add it as a markdown file in the folder [/docs/content/reference/export/](/docs/content/reference/export/). The documentation should include
- code snippet with example
- table that explains all parameters of the function
- overview of any information loss that happens when exporting to this format
@@ -86,7 +88,7 @@ What is expected:
Data preparation is definitely more than simply extracting and exporting data, so we also welcome additional functionality that support activities like data exploration, data cleaning, data filtering, ...
The previously discussed items still apply:
-1. Start from the [Stack't relational schema](#stackt-relational-schema) in a DuckDB file.
+1. Start from the [Stack't relational schema](/docs/content/explained/pystackt_design.md) in a DuckDB file.
- If the Stack't relational schema does not work with the application you have in mind, include a function to prepare the data first. ([example](/src/pystackt/exploration/graph/data_prep/))
1. Clean up your code. Document your code. Test your code.
1. Write end-user documentation.
@@ -106,89 +108,3 @@ Simply create a pull request (PR)! Some good practices to consider:
- documentation of one function + code improvements of another function
- Write meaningful commit messages.
- Don't combine independent changes in the same commit.
-
-
-## Stack't relational schema
-
-The Stack't relational schema describes how to store object-centric event data in a relational database using a fixed set of tables and table columns. This absence of any schema changes makes the format well-suited to act as a central data hub, enabling the modular design of PyStack't.
-
-An overview of the tables and columns is included in this document. For more information on the design choices and the proof-of-concept implementation [Stack't](https://github.com/LienBosmans/stack-t), we recommend reading the paper [Dynamic and Scalable Data Preparation for Object-Centric Process Mining](https://arxiv.org/abs/2410.00596).
-
-
-
-**Event-related tables**. To maintain flexibility and support dynamic changes, event types and their attribute definitions are stored in rows rather than being defined by table and column names. This approach enables the use of the exact same tables across all processes, reducing the impact of schema modifications. Changing an event type involves updating foreign keys rather than moving data to different tables, and attributes can be added or removed without altering the schema.
-- Table `event_types` contains an entry for each unique event type. \
- Columns:
- - `id` is the primary key.
- - `description` should be human-readable.
-- Table `event_attributes` stores entries for each unique event attribute. \
- Columns:
- - `id` is the primary key.
- - `event_type_id` is a foreign key referencing table `event_types`.
- - `description` should be human-readable.
- - `datatype` of the attribute (integer, varchar, timestamp, ...) of the attribute.
-- Table `events` records details for each event. \
- Columns:
- - `id` is the primary key.
- - `event_type_id` is a foreign key referencing table `event_types`.
- - `timestamp`, preferably using UTC time zone.
- - `description` should be human-readable.
-- Table `event_attribute_values` stores all attribute values for different events. This setup decouples events and their attributes by storing each attribute value in a new row, facilitating support for late-arriving data points. \
- Columns:
- - `id` is the primary key.
- - `event_id` is a foreign key referencing table `events`.
- - `event_attribute_id` is a foreign key referencing table `event_attributes`.
- - `attribute_value` is the value of the attribute. This value should match the datatype of the attribute.
-
-**Object-related tables** also leverage row-based storage to manage attributes independently. This approach reduces the number of duplicate or NULL values significantly when attributes are updated asynchronously and frequently.
-- Table `object_types` records entries for each unique object type.\
- Columns:
- - `id` is the primary key.
- - `description` should be human-readable
-- Table `object_attributes` contains entries for each unique object attribute. \
- Columns:
- - `id` is the primary key.
- - `object_type_id` is a foreign key referencing table `object_types`.
- - `description` should be human-readable.
- - `datatype` (integer, varchar, timestamp, ...) of the attribute.
-- Table `object` stores details for each object.\
- Columns:
- - `id` is the primary key.
- - `object_type_id` is a foreign key referencing table `object_types`.
- - `description` should be human-readable.
-- Table `object_attribute_values` records attribute values for objects.\
- Columns:
- - `id` is the primary key.
- - `object_id` is a foreign key referencing table `objects`.
- - `object_attribute_id` is a foreign key referencing table `object_attributes`.
- - `timestamp` indicates when the attribute was updated. Timestamps are preferably stored using the UTC time zone.
- - `attribute_value` is the updated value of the attribute. This value should match the datatype of the attribute.
-
-**Relation-related tables** serve as bridging tables to manage the different many-to-many relations between events and objects. The qualifier definitions are stored separately to minimize the impact of renaming them in case of changing business requirements
-- Table `relation_qualifiers` stores qualifier definitions. In cases where relation qualifiers are not available in the source data, a dummy qualifier can be introduced.\
- Columns
- - `id` is the primary key.
- - `description` should be human-readable.
- - `datatype` (integer, varchar, timestamp, ...) of the attribute.
-- Table `object_to_object` stores (dynamic) relations between objects.\
- Columns:
- - `id` is the primary key.
- - `source_object_id` is a foreign key referencing table `objects`.
- - `target_object_id` is a foreign key referencing table `objects`.
- - `timestamp` indicates when the relationship became active. To signify the end of an object-to-object relationship, a NULL value is used for the qualifier value, rather than an end timestamp. This design choice facilitates append-only data ingestion. Timestamps are preferably stored using the UTC time zone.
- - `qualifier_id` is a foreign key referencing table `qualifiers`.
- - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
-- Table `event_to_object` stores relations between events and objects.\
- Columns:
- - `id` is the primary key.
- - `event_id` is a foreign key referencing table `events`.
- - `object_id` is a foreign key referencing table `objects`.
- - `qualifier_id` is a foreign key referencing table `qualifiers`.
- - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
-- Table `event_to_object_attribute_value` stores relations between events and changes to object attributes.\
- Columns:
- - `id` is the primary key.
- - `event_id` is a foreign key referencing table `events`.
- - `object_attribute_value_id` is a foreign key referencing table `object_attribute_values`.
- - `qualifier_id` is a foreign key referencing table `qualifiers`.
- - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
diff --git a/README.md b/README.md
index 8607971..ebc34e0 100644
--- a/README.md
+++ b/README.md
@@ -9,28 +9,15 @@ PyStack't is published on [PyPi](https://pypi.org/project/pystackt/) and can be
pip install pystackt
```
-## [📖 Documentation](https://lienbosmans.github.io/pystackt/)
+## 📖 Documentation
- [Extensive documentation](https://lienbosmans.github.io/pystackt/) is available via GitHub pages.
- A [demo video on Youtube](https://youtu.be/AS8wI90wRM8) can walk you throught the different functionalities.
-
-## 🔍 Viewing Data
-PyStack't creates **DuckDB database files**. From DuckDB version 1.2.1 onwards, you can explore them using the [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html). Below code will load the UI by navigating to `http://localhost:4213` in your default browser.
-
-```python
-import duckdb
-
-with duckdb.connect("./stackt.duckdb") as quack:
- quack.sql("CALL start_ui()")
- input("Press Enter to close the connection...")
-```
-
-Alternatively, you can use a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.
-
+- Our BPM 2025 demo paper [PyStack't: Real-Life Data for Object-Centric Process Mining](https://ceur-ws.org/Vol-4032/paper-28.pdf) is available on CEUR.
## 📝 Examples
-### ⛏️🐙 Extract object-centric event log from GitHub repo ([`get_github_log`](https://lienbosmans.github.io/pystackt/extract/get_github_log.html))
+### ⛏️🐙 Extract object-centric event log from GitHub repo ([`get_github_log`](https://lienbosmans.github.io/pystackt/content/reference/extract/get_github_log.html)
```python
from pystackt import *
@@ -44,7 +31,7 @@ get_github_log(
)
```
-### 📈 Interactive data exploration ([`start_visualization_app`](https://lienbosmans.github.io/pystackt/exploration/interactive_data_visualization_app.html))
+### 📈 Interactive data exploration ([`start_visualization_app`](https://lienbosmans.github.io/pystackt/content/reference/exploration/interactive_data_visualization_app.html))
```python
from pystackt import *
@@ -61,7 +48,7 @@ start_visualization_app(
)
```
-### 📤 Export to OCEL 2.0 ([`export_to_ocel2`](https://lienbosmans.github.io/pystackt/export/export_to_ocel2.html))
+### 📤 Export to OCEL 2.0 ([`export_to_ocel2`](https://lienbosmans.github.io/pystackt/content/reference/export/export_to_ocel2.html)
```python
from pystackt import *
diff --git a/docs/README.md b/docs/README.md
index 05fa10e..0458931 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,52 +1,30 @@
# PyStack't Documentation
-PyStack't (`pip install pystackt`) is a Python package that supports data preparation for object-centric process mining. It covers extraction of object-centric event data, storage of that data, (visual) data exploration, and export to OCED formats.
+PyStack't is a Python package that supports data preparation for object-centric process mining. It covers extraction of object-centric event data, storage of that data, (visual) data exploration, and export to popular OCED formats.
-[Source code](https://github.com/LienBosmans/pystackt) | [PyPi](https://pypi.org/project/pystackt/) | [Contributing Guide](https://github.com/LienBosmans/pystackt/blob/main/CONTRIBUTING.md)
+The documentation is structured in four different parts:
+- [Tutorials](#-tutorials-start-here): hands-on lessons for beginners
+- [Reference material](#-reference-material): technical descriptions
+- [How-to guides](#-how-to-guides): practical directions
+- [Behind-the-scenes](#-behind-the-scenes): context and background
+## 📚 Tutorials (start here)
-## Data Storage
+- [Extracting your first object-centric event log from a GitHub repository](content/tutorials/tutorial_extracting_OCED.md)
-PyStack't uses the Stack't relational schema to store object-centric event data. This schema was created specifically to support the data preparation stage, taking into account data engineering best practices. For more information on the design of Stack't, we recommend the paper [Dynamic and Scalable Data Preparation for Object-Centric Process Mining](https://arxiv.org/abs/2410.00596).
+## 📖 Reference material
+### Functions
+- [⛏️ get_github_log](content/reference/extract/get_github_log.md)
+- [📤 export_to_ocel2](content/reference/export/export_to_ocel2.md)
+- [📤 export_to_promg](content/reference/export/export_to_promg.md)
+- [📈 create_statistics_views](content/reference/exploration/create_statistics_views.md)
+- [📈 interactive data visualization app](content/reference/exploration/interactive_data_visualization_app.md)
-
+### Output data
+- [🗺️ Overview of `get_github_log` output](content/reference/extract/github_OCED.md)
-While any relational database can be used to store data in the Stack't relational schema, PyStack't uses [DuckDB](https://duckdb.org/) because it's open-source, fast and simple to use. (Think SQLite but for analytical workloads.)
+## ❓ How-to guides
+- [How to view DuckDB files?](content/howto/view_duckdb_files.md)
-From DuckDB version 1.2.1 onwards, you can explore them using the [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html). Below code will load the UI by navigating to `http://localhost:4213` in your default browser.
-
-```python
-import duckdb
-
-with duckdb.connect("./stackt.duckdb") as quack:
- quack.sql("CALL start_ui()")
- input("Press Enter to close the connection...")
-```
-
-Alternatively, you can use a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.
-
-
-## Data extraction
-
-Extracting data from different systems is an important part of data preparation. While PyStack't does not include all functionality that a data stack offers (incremental ingests, scheduling refreshes, monitoring data pipelines...), it aims to provide simple-to-use methods to get real-life data for your object-centric process mining adventures.
-
-### ⛏️ List of data extraction functionality
-- [`get_github_log`](extract/get_github_log.md)
-
-
-## Data export
-
-The Stack't relational schema is intended as an intermediate storage hub. PyStack't provides export functionality to export the data to specific OCED formats that can be used by process mining applications and algorithms. This decoupled set-up has as main advantage that any future data source can be exported to all supported data formats, and any future OCED format can be combined with existing data extraction functionality.
-
-### 📤 List of data export functionality
-- [`export_to_ocel2`](export/export_to_ocel2.md)
-- [`export_to_promg`](export/export_to_promg.md)
-
-
-## Data exploration
-
-Dispersing process data across multiple tables makes exploring object-centric event data less straightforward compared to traditional process mining. PyStack't aims to bridge this gap by providing dedicated data exploration functionality. Notably, the latest release includes an interactive data exploration app that runs locally and works out-of-the-box with any OCED data structured in the Stack't relational schema.
-
-### 📈 List of data exploration functionality
-- [`create_statistics_views`](exploration/create_statistics_views.md)
-- [`interactive data visualization app`](exploration/interactive_data_visualization_app.md)
+## 💡 Behind-the-scenes
+- [About the design of PyStack't](content/explained/pystackt_design.md)
diff --git a/docs/_config.yml b/docs/_config.yml
new file mode 100644
index 0000000..65221e7
--- /dev/null
+++ b/docs/_config.yml
@@ -0,0 +1,17 @@
+theme: jekyll-theme-minimal
+
+title: PyStack't Documentation
+description: "Real-life data for object-centric processing mining"
+logo: /assets/images/pystackt_logo_black_circle_small.png
+
+paper_url: https://ceur-ws.org/Vol-4032/paper-28.pdf
+paper_title: "L. Bosmans, J. Peeperkorn, J. De Smedt, PyStack’t: Real-Life Data for Object-Centric Process Mining"
+
+demo_url: https://www.youtube.com/watch?v=AS8wI90wRM8&feature=youtu.be
+demo_title: "PyStack't Demo BPM 2025"
+
+pypi_url: https://pypi.org/project/pystackt/
+pypi_title: "pip install pystackt"
+
+contributing_url: https://github.com/LienBosmans/pystackt/blob/main/CONTRIBUTING.md
+contributing_title: "Contributing guide"
diff --git a/docs/_includes/head-custom.html b/docs/_includes/head-custom.html
new file mode 100644
index 0000000..df51bab
--- /dev/null
+++ b/docs/_includes/head-custom.html
@@ -0,0 +1,9 @@
+
+
+
+{% include head-custom-google-analytics.html %}
+
+
+
+
+
diff --git a/docs/_layouts/default.html b/docs/_layouts/default.html
new file mode 100644
index 0000000..6a186ba
--- /dev/null
+++ b/docs/_layouts/default.html
@@ -0,0 +1,51 @@
+
+
+
+
+
+
+
+{% seo %}
+
+
+ {% include head-custom.html %}
+
+
+
+
+
+
diff --git a/docs/assets/css/style.scss b/docs/assets/css/style.scss
new file mode 100644
index 0000000..5742f93
--- /dev/null
+++ b/docs/assets/css/style.scss
@@ -0,0 +1,4 @@
+---
+---
+
+@import "{{ site.theme }}";
diff --git a/docs/assets/images/favicon.ico b/docs/assets/images/favicon.ico
new file mode 100644
index 0000000..740d8b9
Binary files /dev/null and b/docs/assets/images/favicon.ico differ
diff --git a/docs/pystackt_architecture.png b/docs/assets/images/pystackt_architecture.png
similarity index 100%
rename from docs/pystackt_architecture.png
rename to docs/assets/images/pystackt_architecture.png
diff --git a/docs/assets/images/pystackt_logo_black_circle_small.png b/docs/assets/images/pystackt_logo_black_circle_small.png
new file mode 100644
index 0000000..acb2778
Binary files /dev/null and b/docs/assets/images/pystackt_logo_black_circle_small.png differ
diff --git a/docs/assets/images/pystackt_logo_black_circle_small.svg b/docs/assets/images/pystackt_logo_black_circle_small.svg
new file mode 100644
index 0000000..1f73ead
--- /dev/null
+++ b/docs/assets/images/pystackt_logo_black_circle_small.svg
@@ -0,0 +1,2 @@
+
\ No newline at end of file
diff --git a/docs/content/explained/pystackt_design.md b/docs/content/explained/pystackt_design.md
new file mode 100644
index 0000000..39d5ed6
--- /dev/null
+++ b/docs/content/explained/pystackt_design.md
@@ -0,0 +1,106 @@
+# PyStack't design
+
+PyStack't has a modular design, enabled by the use of a fixed relational schema to store OCED data. Since the tables and columns are always the same, independent of object and event types, it's easy to plug in new data sources and other functionality. Any new functionality is automatically compatible with existing code, which is a big win for maintainability.
+
+
+
+## Data Storage
+
+PyStack't uses the Stack't relational schema to store object-centric event data. This schema was created specifically to support the data preparation stage, taking into account data engineering best practices.
+While any relational database can be used to store data in the Stack't relational schema, PyStack't uses [DuckDB](https://duckdb.org/) because it's open-source, fast and simple to use. (Think SQLite but for analytical workloads.)
+
+### Stack't relational schema
+
+The Stack't relational schema describes how to store object-centric event data in a relational database using a fixed set of tables and table columns. This absence of any schema changes makes the format well-suited to act as a central data hub, enabling the modular design of PyStack't.
+
+An overview of the tables and columns is included below. For more information on the design choices and the proof-of-concept implementation [Stack't](https://github.com/LienBosmans/stack-t), we recommend reading the paper [Dynamic and Scalable Data Preparation for Object-Centric Process Mining](https://arxiv.org/abs/2410.00596).
+
+**Event-related tables**. To maintain flexibility and support dynamic changes, event types and their attribute definitions are stored in rows rather than being defined by table and column names. This approach enables the use of the exact same tables across all processes, reducing the impact of schema modifications. Changing an event type involves updating foreign keys rather than moving data to different tables, and attributes can be added or removed without altering the schema.
+- Table `event_types` contains an entry for each unique event type. \
+ Columns:
+ - `id` is the primary key.
+ - `description` should be human-readable.
+- Table `event_attributes` stores entries for each unique event attribute. \
+ Columns:
+ - `id` is the primary key.
+ - `event_type_id` is a foreign key referencing table `event_types`.
+ - `description` should be human-readable.
+ - `datatype` of the attribute (integer, varchar, timestamp, ...) of the attribute.
+- Table `events` records details for each event. \
+ Columns:
+ - `id` is the primary key.
+ - `event_type_id` is a foreign key referencing table `event_types`.
+ - `timestamp`, preferably using UTC time zone.
+ - `description` should be human-readable.
+- Table `event_attribute_values` stores all attribute values for different events. This setup decouples events and their attributes by storing each attribute value in a new row, facilitating support for late-arriving data points. \
+ Columns:
+ - `id` is the primary key.
+ - `event_id` is a foreign key referencing table `events`.
+ - `event_attribute_id` is a foreign key referencing table `event_attributes`.
+ - `attribute_value` is the value of the attribute. This value should match the datatype of the attribute.
+
+**Object-related tables** also leverage row-based storage to manage attributes independently. This approach reduces the number of duplicate or NULL values significantly when attributes are updated asynchronously and frequently.
+- Table `object_types` records entries for each unique object type.\
+ Columns:
+ - `id` is the primary key.
+ - `description` should be human-readable
+- Table `object_attributes` contains entries for each unique object attribute. \
+ Columns:
+ - `id` is the primary key.
+ - `object_type_id` is a foreign key referencing table `object_types`.
+ - `description` should be human-readable.
+ - `datatype` (integer, varchar, timestamp, ...) of the attribute.
+- Table `object` stores details for each object.\
+ Columns:
+ - `id` is the primary key.
+ - `object_type_id` is a foreign key referencing table `object_types`.
+ - `description` should be human-readable.
+- Table `object_attribute_values` records attribute values for objects.\
+ Columns:
+ - `id` is the primary key.
+ - `object_id` is a foreign key referencing table `objects`.
+ - `object_attribute_id` is a foreign key referencing table `object_attributes`.
+ - `timestamp` indicates when the attribute was updated. Timestamps are preferably stored using the UTC time zone.
+ - `attribute_value` is the updated value of the attribute. This value should match the datatype of the attribute.
+
+**Relation-related tables** serve as bridging tables to manage the different many-to-many relations between events and objects. The qualifier definitions are stored separately to minimize the impact of renaming them in case of changing business requirements
+- Table `relation_qualifiers` stores qualifier definitions. In cases where relation qualifiers are not available in the source data, a dummy qualifier can be introduced.\
+ Columns
+ - `id` is the primary key.
+ - `description` should be human-readable.
+ - `datatype` (integer, varchar, timestamp, ...) of the attribute.
+- Table `object_to_object` stores (dynamic) relations between objects.\
+ Columns:
+ - `id` is the primary key.
+ - `source_object_id` is a foreign key referencing table `objects`.
+ - `target_object_id` is a foreign key referencing table `objects`.
+ - `timestamp` indicates when the relationship became active. To signify the end of an object-to-object relationship, a NULL value is used for the qualifier value, rather than an end timestamp. This design choice facilitates append-only data ingestion. Timestamps are preferably stored using the UTC time zone.
+ - `qualifier_id` is a foreign key referencing table `qualifiers`.
+ - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
+- Table `event_to_object` stores relations between events and objects.\
+ Columns:
+ - `id` is the primary key.
+ - `event_id` is a foreign key referencing table `events`.
+ - `object_id` is a foreign key referencing table `objects`.
+ - `qualifier_id` is a foreign key referencing table `qualifiers`.
+ - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
+- Table `event_to_object_attribute_value` stores relations between events and changes to object attributes.\
+ Columns:
+ - `id` is the primary key.
+ - `event_id` is a foreign key referencing table `events`.
+ - `object_attribute_value_id` is a foreign key referencing table `object_attribute_values`.
+ - `qualifier_id` is a foreign key referencing table `qualifiers`.
+ - `qualifier_value` provides additional relationship details. This value should match the datatype of the qualifier.
+
+
+## Data extraction
+
+Extracting data from different systems is an important part of data preparation. While PyStack't does not include all functionality that a data stack offers (incremental ingests, scheduling refreshes, monitoring data pipelines...), it aims to provide simple-to-use methods to get real-life data for your object-centric process mining adventures.
+
+## Data export
+
+The Stack't relational schema is intended as an intermediate storage hub. PyStack't provides export functionality to export the data to specific OCED formats that can be used by process mining applications and algorithms. This decoupled set-up has as main advantage that any future data source can be exported to all supported data formats, and any future OCED format can be combined with existing data extraction functionality.
+
+## Data exploration
+
+Dispersing process data across multiple tables makes exploring object-centric event data less straightforward compared to traditional process mining. PyStack't aims to bridge this gap by providing dedicated data exploration functionality. Notably, tit includes an interactive data exploration app that runs locally and works out-of-the-box with any OCED data structured in the Stack't relational schema.
diff --git a/docs/content/howto/view_duckdb_files.md b/docs/content/howto/view_duckdb_files.md
new file mode 100644
index 0000000..fd2730e
--- /dev/null
+++ b/docs/content/howto/view_duckdb_files.md
@@ -0,0 +1,23 @@
+# How to view DuckDB files?
+
+When working with PyStack’t, event data is stored in a DuckDB file (e.g. `stackt.duckdb`) using the Stack’t relational schema.
+If you want to inspect the contents of such a file — for example to check tables, run queries, or verify data extraction — you have a few options.
+
+## Option 1: Use the DuckDB UI extension
+
+From DuckDB version 1.2.1 onwards, a [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html) is included. You can launch th extension by running below Python code.
+
+```python
+import duckdb
+
+with duckdb.connect("./stackt.duckdb") as quack:
+ quack.sql("CALL start_ui()")
+ input("Press Enter to close the connection...")
+```
+
+This opens a web UI at http://localhost:4213 where you can browse tables and run SQL queries interactively.
+Press Enter in your terminal to stop the session when you’re done.
+
+## Option 2: Use a database manager
+
+Follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install [DBeaver](https://dbeaver.io/about/) and use it to connect to your DuckDB file.
diff --git a/docs/exploration/app_screenshot.png b/docs/content/reference/exploration/app_screenshot.png
similarity index 100%
rename from docs/exploration/app_screenshot.png
rename to docs/content/reference/exploration/app_screenshot.png
diff --git a/docs/exploration/create_statistics_views.md b/docs/content/reference/exploration/create_statistics_views.md
similarity index 78%
rename from docs/exploration/create_statistics_views.md
rename to docs/content/reference/exploration/create_statistics_views.md
index c7cd89d..f0360aa 100644
--- a/docs/exploration/create_statistics_views.md
+++ b/docs/content/reference/exploration/create_statistics_views.md
@@ -41,14 +41,5 @@ Two SQL views are created in the schema defined by `schema_out`. They contain so
If `schema_out` already exists, it will be cleared before the new views are created.
### 🔍 Viewing Data
-PyStack't creates **DuckDB database files**. From DuckDB version 1.2.1 onwards, you can explore them using the [**UI extension**](https://duckdb.org/docs/stable/extensions/ui.html). Below code will load the UI by navigating to `http://localhost:4213` in your default browser.
-```python
-import duckdb
-
-with duckdb.connect("./stackt.duckdb") as quack:
- quack.sql("CALL start_ui()")
- input("Press Enter to close the connection...")
-```
-
-Alternatively, you can use a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.
+The article [How to view DuckDB files](../../howto/view_duckdb_files.md) explains how you can view the result.
diff --git a/docs/exploration/interactive_data_visualization_app.md b/docs/content/reference/exploration/interactive_data_visualization_app.md
similarity index 100%
rename from docs/exploration/interactive_data_visualization_app.md
rename to docs/content/reference/exploration/interactive_data_visualization_app.md
diff --git a/docs/export/export_to_ocel2.md b/docs/content/reference/export/export_to_ocel2.md
similarity index 100%
rename from docs/export/export_to_ocel2.md
rename to docs/content/reference/export/export_to_ocel2.md
diff --git a/docs/export/export_to_promg.md b/docs/content/reference/export/export_to_promg.md
similarity index 96%
rename from docs/export/export_to_promg.md
rename to docs/content/reference/export/export_to_promg.md
index b7676af..0ca4efc 100644
--- a/docs/export/export_to_promg.md
+++ b/docs/content/reference/export/export_to_promg.md
@@ -38,4 +38,4 @@ Afterwards all tables in `schema_out` will be copied as CSV files to a new folde
### ℹ️ About PromG
- PromG is an open-source tool that enables process analytics through Event Knowledge Graphs (EKG).
-- Documentation, including tutorials for getting started, can be found here: https://promg-dev.github.io/promg-core/.
+- Documentation, including tutorials for getting started, can be found in the [PromG documentation](https://promg-dev.github.io/promg-core/).
diff --git a/docs/content/reference/extract/get_github_log.md b/docs/content/reference/extract/get_github_log.md
new file mode 100644
index 0000000..ecc4b95
--- /dev/null
+++ b/docs/content/reference/extract/get_github_log.md
@@ -0,0 +1,47 @@
+# ⛏️🐙 `get_github_log`: Extracting object-centric event logs from GitHub
+
+## 📝 Example
+```python
+from pystackt import *
+
+get_github_log(
+ GITHUB_ACCESS_TOKEN="insert_your_github_access_token_here",
+ repo_owner="LienBosmans",
+ repo_name="stack-t",
+ max_issues=None,
+ save_after_num_issues=1000,
+ quack_db="./stackt.duckdb",
+ schema="main"
+)
+```
+
+| Parameter | Type | Description |
+|------------------------|----------|---------------|
+| `GITHUB_ACCESS_TOKEN` | `str` | Personal GitHub personal access token for API authentication. |
+| `repo_owner` | `str` | Owner of the GitHub repository from which to extract issue activity data. |
+| `repo_name` | `str` | Name of the repository from which to extract issue activity data. |
+| `max_issues` | `int` | Limits the number of issues to extract. If `None`, data for all closed issues will be collected. |
+| `save_after_num_issues`| `int` | Enables intermediate saving after processing this many issues. Prevents data loss from interruptions. Defaults to 5000. |
+| `quack_db` | `str` | Path to the DuckDB file where data will be stored using the Stack’t relational schema. A new file is created if it doesn't exist yet. |
+| `schema` | `str` | Name of the schema in the DuckDB database file where data will be written. If schema already exists, it will be cleared first. |
+
+
+### Generating a GitHub Access Token (`GITHUB_ACCESS_TOKEN`)
+To generate a GitHub access token, go to [GitHub Developer Settings](https://github.com/settings/tokens), click **"Generate new token (classic)"**, and proceed without selecting any scopes (leave all checkboxes unchecked). Copy the token and store it securely, as it won’t be shown again.
+
+### GithHub Repository (`repo_owner`/`repo_name`)
+These values are used to identify the GitHub repository from which the activity data will be extracted. You don't need to own or fork the repository. Permission to view the repository is sufficient, which is always the case for public repositories. If you want to extract data from a private repository, you'll need to create a different access token with a scope that includes read-only access to that repo.
+
+### Maximum number of issues to return (`max_issues`)
+Because of API rate limits, fetching all activity data from a repo can take a long time. This parameter controls the number of issues for which data will be collected. Setting it to a positive integer `n` will return data from the `n` most recent issues with status "closed". Setting it to `None` will return the activity data for all closed issues.
+
+### Intermediate save functionality (`save_after_num_issues`)
+To mitigate the risk of forced system restarts and GitHub API outages, intermediate save functionality was added. Setting it to a reasonable number such as 500 or 1000 does not impact total extraction times and greatly decreases potential disappointment. The default value is 5000.
+
+### Output database file (`quack_db`) and schema (`schema`)
+The extracted data will be stored in a DuckDB database using the Stack't relational schema. The tables will be stored in the given `schema`; the default schema is main. If the file does not exist yet, a new file will be created. Existing tables in `schema` will be overwritten.
+
+To explore the data, you'll need a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.
+
+## 📜 Data Usage Policies
+Please ensure that you extract and use the data in **compliance with GitHub policies**, including [Information Usage Restrictions](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions) and [API Terms](https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#h-api-terms).
diff --git a/docs/extract/get_github_log.md b/docs/content/reference/extract/github_OCED.md
similarity index 54%
rename from docs/extract/get_github_log.md
rename to docs/content/reference/extract/github_OCED.md
index 163c0cf..f2100b2 100644
--- a/docs/extract/get_github_log.md
+++ b/docs/content/reference/extract/github_OCED.md
@@ -1,51 +1,6 @@
-# ⛏️🐙 Extracting object-centric event logs from GitHub (`get_github_log`)
+# 🗺️ Overview of `get_github_log` output
-## 📝 Example
-```python
-from pystackt import *
-
-get_github_log(
- GITHUB_ACCESS_TOKEN="insert_your_github_access_token_here",
- repo_owner="LienBosmans",
- repo_name="stack-t",
- max_issues=None,
- save_after_num_issues=1000,
- quack_db="./stackt.duckdb",
- schema="main"
-)
-```
-
-| Parameter | Type | Description |
-|------------------------|----------|---------------|
-| `GITHUB_ACCESS_TOKEN` | `str` | Personal GitHub personal access token for API authentication. |
-| `repo_owner` | `str` | Owner of the GitHub repository from which to extract issue activity data. |
-| `repo_name` | `str` | Name of the repository from which to extract issue activity data. |
-| `max_issues` | `int` | Limits the number of issues to extract. If `None`, data for all closed issues will be collected. |
-| `save_after_num_issues`| `int` | Enables intermediate saving after processing this many issues. Prevents data loss from interruptions. Defaults to 5000. |
-| `quack_db` | `str` | Path to the DuckDB file where data will be stored using the Stack’t relational schema. A new file is created if it doesn't exist yet. |
-| `schema` | `str` | Name of the schema in the DuckDB database file where data will be written. If schema already exists, it will be cleared first. |
-
-
-### Generating a GitHub Access Token (`GITHUB_ACCESS_TOKEN`)
-To generate a GitHub access token, go to [GitHub Developer Settings](https://github.com/settings/tokens), click **"Generate new token (classic)"**, and proceed without selecting any scopes (leave all checkboxes unchecked). Copy the token and store it securely, as it won’t be shown again.
-
-### GithHub Repository (`repo_owner`/`repo_name`)
-These values are used to identify the GitHub repository from which the activity data will be extracted. You don't need to own or fork the repository. Permission to view the repository is sufficient, which is always the case for public repositories. If you want to extract data from a private repository, you'll need to create a different access token with a scope that includes read-only access to that repo.
-
-### Maximum number of issues to return (`max_issues`)
-Because of API rate limits, fetching all activity data from a repo can take a long time. This parameter controls the number of issues for which data will be collected. Setting it to a positive integer `n` will return data from the `n` most recent issues with status "closed". Setting it to `None` will return the activity data for all closed issues.
-
-### Intermediate save functionality (`save_after_num_issues`)
-To mitigate the risk of forced system restarts and GitHub API outages, intermediate save functionality was added. Setting it to a reasonable number such as 500 or 1000 does not impact total extraction times and greatly decreases potential disappointment. The default value is 5000.
-
-### Output database file (`quack_db`) and schema (`schema`)
-The extracted data will be stored in a DuckDB database using the Stack't relational schema. The tables will be stored in the given `schema`; the default schema is main. If the file does not exist yet, a new file will be created. Existing tables in `schema` will be overwritten.
-
-To explore the data, you'll need a database manager. You can follow this [DuckDB guide](https://duckdb.org/docs/guides/sql_editors/dbeaver.html) to download and install **DBeaver** for easy access.
-
-## 🗺️ Extracted Data Overview
-
-### Object Types
+## Object Types
| Object Type | Description |
|---------------|-------------------------------------------|
@@ -56,9 +11,7 @@ To explore the data, you'll need a database manager. You can follow this [DuckDB
> **Note:** Commits are often not linked to a GitHub user account. In such cases, the committer's name is used instead of an unique user ID, which may result in multiple user objects for the same person.
-### Object Attributes
-
-### Object Attributes
+## Object Attributes
| Object | Attribute | Description |
|---------|---------------------|-------------------------------------------------------|
@@ -77,24 +30,22 @@ To explore the data, you'll need a database manager. You can follow this [DuckDB
| Commit | **commit_message** | Commit message describing the changes. |
| Commit | **url** | URL to view the commit on GitHub. |
-
-
> **Note:** "Committers" do not have user attributes.
-### Event Types
+## Event Types
| Event Type | Description |
|---------------|-------------------|
| **All GitHub timeline events** (except `line-commented`) | [See GithHub documentation for more details.](https://docs.github.com/en/rest/using-the-rest-api/issue-event-types) |
| **Created** | For new issues. |
-### Event Attributes
+## Event Attributes
| Event Type | Attribute | Description |
|-------------------|---------------------------|---------------------------------------|
| Timeline event | **author_association** | Only when available in API response. |
-### Event-to-Object Relations
+## Event-to-Object Relations
| Event | Related Object | Relation | Description |
|-----------------------------------------------|-------------------|---------------------------|-------------------------------------------------------|
@@ -108,7 +59,7 @@ To explore the data, you'll need a database manager. You can follow this [DuckDB
> **Note:** actor is not available for the event type `committed`.
-### Object-to-Object Relations
+## Object-to-Object Relations
| Object | Related Object | Relation | Description |
|-----------|-------------------|---------------------------|-------------------------------------------------------|
@@ -118,6 +69,3 @@ To explore the data, you'll need a database manager. You can follow this [DuckDB
For implementation details, you can check the documentation of the code itself in the [GitHub repository](https://github.com/LienBosmans/pystackt/tree/main/src/pystackt/extractors/github).
-
-## 📜 Data Usage Policies
-Please ensure that you extract and use the data in **compliance with GitHub policies**, including [Information Usage Restrictions](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions) and [API Terms](https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#h-api-terms).
diff --git a/docs/content/tutorials/tutorial_extracting_OCED.md b/docs/content/tutorials/tutorial_extracting_OCED.md
new file mode 100644
index 0000000..3e80e25
--- /dev/null
+++ b/docs/content/tutorials/tutorial_extracting_OCED.md
@@ -0,0 +1,105 @@
+# Tutorial: Extracting your first object-centric event log from a GitHub repository
+
+In this tutorial, you’ll extract an **object-centric event log** from a real GitHub repository. You don’t need to know much about process mining or GitHub yet — we’ll walk through it step by step.
+
+By the end, you will:
+- Install and use **PyStack’t**, a Python library for extracting object-centric event data.
+- Create a GitHub personal access token.
+- Extract event data from a repository.
+- Export the data into the OCEL 2.0 format.
+- Load the resulting log into Ocelot, a tool for analyzing object-centric logs.
+
+Let’s get started!
+
+# Prerequisites
+
+Before we begin, make sure you have the following:
+
+- **Python 3.12 or higher** installed on your computer.
+ You can check this by running `python --version` in your terminal.
+ If you don’t have it yet, download and install it from [python.org](https://www.python.org/downloads/).
+
+- A **GitHub account**.
+ If you don’t already have one, you can create it for free at [github.com](https://github.com/).
+ We’ll need it to generate an access token later.
+
+With these in place, you’re ready to proceed.
+
+# Install PyStack't
+
+First, we need to install the library we’ll be using: **PyStack’t**.
+Open a terminal (or command prompt) and run:
+
+```sh
+pip install pystackt
+```
+
+This will download and install the package from PyPI.
+If you see errors about pip not being found, try running `python -m pip install pystackt` instead.
+
+Once installed, you can test it worked by running:
+
+```sh
+python -c "import pystackt; print('PyStackt is installed!')"
+```
+
+If you see the confirmation message, you’re ready to move on.
+
+# Generate GitHub access token
+
+PyStack’t needs permission to read from GitHub repositories. For this, we use a personal access token.
+1. Log in to your [GitHub account](https://github.com/).
+1. Go to [GitHub Developer Settings](https://github.com/settings/tokens).
+1. Click "Generate new token (classic)".
+1. Don’t select any scopes (leave all checkboxes unchecked).
+1. Generate the token and copy it. GitHub will only show it once, so store it somewhere safe. We’ll use this token in the next step.
+ - If you lose your token, you can always generate a new one.
+ - If you accidently share your token, for example by comitting it to Git, it's good practice to delete it and generate a new token.
+
+# Extract object-centric event data from GitHub repository
+
+Now let’s extract event data from a real repository.
+
+Open a Python file (for example `extract_log.py`) or a Jupyter Notebook, and paste in the following code:
+
+```python
+from pystackt import *
+
+get_github_log(
+ GITHUB_ACCESS_TOKEN="insert_your_github_access_token_here",
+ repo_owner="LienBosmans",
+ repo_name="stack-t",
+ max_issues=None, # None returns all issues, can also be set to an integer to extract a limited data set
+ quack_db="./stackt.duckdb",
+ schema="main"
+)
+```
+
+👉 Replace `insert_your_github_access_token_here` with the token you generated earlier.
+
+When you run this code, PyStack’t will connect to GitHub, fetch issue and pull request data from the [stackt-t repository](https://github.com/LienBosmans/stack-t), and store it locally in a DuckDB database file called `stackt.duckdb`.
+
+# Export to OCEL 2.0
+
+Now that the raw data is stored, let’s export it into the OCEL 2.0 format. This is a common format for object-centric event logs.
+
+Add the following code to your script or notebook:
+
+```python
+export_to_ocel2(
+ quack_db="./stackt.duckdb",
+ schema_in="main",
+ schema_out="ocel2",
+ sqlite_db="./ocel2_stackt.sqlite"
+)
+```
+
+This will create a new SQLite database file called `ocel2_stackt.sqlite` that contains the event log in OCEL 2.0 format.
+You now have a portable log that can be used in other analysis tools.
+
+# Load OCEL 2.0 log in Ocelot
+
+Finally, let’s open the log in the analysis tool Ocelot.
+1. Go to [https://ocelot.pm/](https://ocelot.pm/)
+1. Drag and drop `ocel2_stackt.sqlite` in the Event Log Import window.
+1. You should now see the object-centric event log extracted from the GitHub repository! 🎉