OCFLSuite is a research-oriented Python package and collection of scripts for designing, generating, and running experiments in Clustered Federated Learning (CFL). The repository contains dataset generation code, simulation drivers, utilities for aggregation and clustering, model templates, and experiment explanation tools used in related research.
This README documents how to install, run, and extend the project, and describes the main repository layout.
- Language: Python
- Supported Python: as declared in
pyproject.toml(currently: >=3.11, <4.0) — use a Python interpreter matching the project metadata or updatepyproject.tomlif you require a different minor version - Main components: dataset generation, simulation drivers, clustering/aggregation utilities, explanation generation
- Installation and environment setup (Python version, virtual env)
- Generating datasets (location and format notes)
- Running simulations (example commands and where to find scripts)
- Explanation / postprocessing framework
- Project layout (what each folder contains)
- Contributing, testing and license
Prerequisites
- Python 3.10 (the project is tested with Python 3.10.x)
- Git
Recommended: create and use a dedicated virtual environment (venv, conda, or pyenv) or use Poetry to manage the project environment and dependencies.
Using a plain venv (optional)
# create and activate a venv (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1
# confirm Python version
python --versionPoetry-based installation (recommended)
This project includes a pyproject.toml, so the recommended way to manage the environment is via Poetry. Poetry will create an isolated virtual environment and install dependencies declared in pyproject.toml.
- Install Poetry (follow https://python-poetry.org/docs/ for the latest instructions). Example install command (PowerShell):
# install Poetry (using pipx)
pipx install poetry- Ensure Poetry uses a compatible Python interpreter (matching
pyproject.toml). You can point Poetry to a specific local Python executable, or to a supported minor version if available:
# point poetry to an existing Python interpreter (example)
poetry env use C:\\Python311\\python.exe
# or use a version specifier if installed (example)
poetry env use 3.11- Install dependencies and create the virtual environment:
poetry install- Run commands inside the Poetry environment:
poetry run python -c "import src; print('src import OK')"
poetry run python experiments\clustering_simulation.pyNote: If you deliberately want to use Python 3.10 but pyproject.toml requires >=3.11, either update pyproject.toml to accept 3.10, or install and use a Python interpreter matching the current pyproject.toml constraint.
Datasets generated by this project are stored under experiments/datasets/ and follow a consistent layout per dataset (MNIST, FMNIST, CIFAR10, PATHMNIST, BLOODMNIST, ...). Each dataset has subfolders for split type (nonoverlaping / overlaping), balancing (balanced / imbalanced) and client counts (15 / 30).
Generation scripts are located inside the corresponding dataset folders. Example path for a split generator:
experiments/datasets/<DATASET>/<split_type>/<balance>/<num_clients>/data_generation_split.py
When you run a data generation script it creates:
- A full dataset in HuggingFace dataset format
- A cached (Apache Arrow) format for faster loading during simulations
- A blueprint CSV file (used to inspect client partitions)
Important notes
- Simulations use the cached (Arrow) format for performance. If you move a cached dataset between machines (or generate on a different OS), the cached format may not be portable — in such cases transfer the full HuggingFace dataset instead.
- Blueprints are CSVs that list the partition information and are useful for quick inspection and reproducibility.
Example: generate a dataset (PowerShell)
cd experiments/datasets/FMNIST/nonoverlaping/balanced/15
python data_generation_split.pySimulation drivers are located under experiments/ (for example: clustering, temperature, centralised tests). Each dataset folder also contains example scripts for centralised experiments.
Common simulation scripts (examples)
experiments/clustering_simulation.py— runs the federated clustering experiments used in the paperexperiments/temperature/simulation_script.py— temperature-based simulations (see folder for specifics)experiments/centralised_tests/<DATASET>/simulation_script.py— centralised baselines and tests
Typical usage (PowerShell)
cd experiments
python clustering_simulation.pyConfiguration
- Most simulation scripts contain an
if __name__ == '__main__':block listing datasets, number of clients, and other hyper-parameters. Edit those lists to control which experiments are run, or import the main functions and call them programmatically.
Logs and outputs
- Numerical results and intermediate outputs are saved under
experiments/results/(subfolders for clustering, temperature, etc.).
The explanation framework is located at experiments/model_explanations/scripted_experiments_framework.py and helper utilities in experiments/model_explanations/.
Before running the explanation scripts set the following global mounts inside scripted_experiments_framework.py:
DATASET_MOUNT— root directory containing datasets (defaultexperiments/datasets)MODEL_MOUNT— root directory containing stored models used for explanationsRESULTS_MOUNT— directory with numerical results (e.g., client-cluster attribution)OUTPUT_MOUNT— where the generated explanations and plots will be written
This decoupling allows you to store large datasets/models on another drive or network mount and still run the framework locally.
Run the framework (example)
cd experiments/model_explanations
python scripted_experiments_framework.pyexperiments/— dataset generation scripts, simulation drivers, and result foldersdatasets/— scripts and generated datasets (by dataset name and split)model_explanations/— explanation generation utilities and frameworkscentralised_tests/— centralised experiment scripts for baselinesresults/— generated numeric and visual results
src/— main library source codeaggregators/— aggregation strategies and implementationsdata_structures/— data structures used by the simulationsfiles/— IO utilities, handlers, and logging helpersmodel/— federated model wrappernet_templates/— model templates (MNIST, FMNIST, ResNet adjustments)node/— federated node implementationoperations/— evaluation routines and orchestration helperssimulation/— core simulation loop and helpersutils/— misc utilities (computations, splitters, animation)
pyproject.toml— project metadata and dependency declarationsREADME.md— this file
File examples
- Blueprint CSVs:
experiments/datasets/<DATASET>/.../<N>/<DATASET>_<N>_dataset_blueprint.csv - Dataset pointers (pickled caches or arrows):
*_dataset_pointers
The src/ package contains reusable components used by the simulation drivers and experiment scripts. Brief description of subpackages and key files:
-
src/aggregators/aggregator.py— base aggregator interfaces and shared utilitiesfedopt_aggregator.py— FedOpt-style aggregation implementationsdistances.py— distance functions used by clustering or similarity measurestemperature.py— utilities and algorithms for temperature-based aggregation
-
src/data_structures/cluster_sturcutre.py— data structures for storing cluster metadata and client-cluster attributions
-
src/files/archive.py— archival helpers for moving or compressing outputshandlers.py— file handling utilities for datasets, pointers, and blueprintsloggers.py— logging wrappers used across experiments
-
src/model/federated_model.py— model wrapper used by nodes and orchestrators (training / evaluation helpers)
-
src/net_templates/mnist_model.py,fmnist_model.py— lightweight model templates for MNIST/FMNIST experimentsresnet_adjusted.py— ResNet modifications for CIFAR / larger experiments
-
src/node/federated_node.py— node-level logic: local training, gradient computation, and communication hooks
-
src/operations/evaluations.py— evaluation metrics and reporting utilitiesorchestrations.py— high-level orchestration helpers used by simulation drivers
-
src/simulation/simulation.py— core simulation loop, experiment harness and utilities used by drivers
-
src/utils/computations.py— numeric helpers and small math utilitiessplitters.py— data splitting utilities used for creating client partitionsselect_gradients.py— helpers for selecting/filtering gradients for privacy or compression experimentsanimation.py— small utilities for visualising results or producing animations
This structure keeps experiment drivers lean and reusable. For most research adaptations, modify either a simulation script under experiments/ or extend/replace components in src/.
There are two main modes to modify experiments:
-
Direct substitution (recommended for quick experiments)
- Edit the simulation script under
experiments/(e.g.,clustering_simulation.py) to change datasets, client counts, models, or hyperparameters.
- Edit the simulation script under
-
Source modification (advanced)
- Modify the library code under
src/to implement new aggregation methods, clustering algorithms, or model templates. This is for advanced customization and reusability.
- Modify the library code under
Contributions are welcome. Please follow these guidelines:
- Fork the repository and create a feature branch.
- Keep changes focused and add small tests where appropriate.
- Open a pull request with a clear description and the motivation for the change.
If you are planning to add large datasets or pre-trained models, consider adding them via an external mount and documenting the mount locations in scripted_experiments_framework.py rather than committing large binary files to git.
This project includes a LICENSE file in the repository root. Refer to it for license terms.
If you have questions about running experiments, dataset generation, or using the explanation framework, open an issue in the repository with reproduction steps and relevant logs.
To reproduce experiments from the paper:
- Use Python 3.10 and install dependencies used during development.
- Generate datasets using the provided generation scripts under
experiments/datasets/. - Run the simulation drivers (for example
clustering_simulation.py) with the lists of datasets and client cardinalities configured in the script. - Use the explanation framework and point its mounts at dataset/model/result folders (see
experiments/model_explanations/scripted_experiments_framework.py).
Good luck, and feel free to request any targeted examples or helper scripts to speed up reproducible runs.