Skip to content

MKZuziak/OCFLSuite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCFLSuite

OCFLSuite is a research-oriented Python package and collection of scripts for designing, generating, and running experiments in Clustered Federated Learning (CFL). The repository contains dataset generation code, simulation drivers, utilities for aggregation and clustering, model templates, and experiment explanation tools used in related research.

This README documents how to install, run, and extend the project, and describes the main repository layout.

Quick facts

  • Language: Python
  • Supported Python: as declared in pyproject.toml (currently: >=3.11, <4.0) — use a Python interpreter matching the project metadata or update pyproject.toml if you require a different minor version
  • Main components: dataset generation, simulation drivers, clustering/aggregation utilities, explanation generation

Checklist (what this README covers)

  • Installation and environment setup (Python version, virtual env)
  • Generating datasets (location and format notes)
  • Running simulations (example commands and where to find scripts)
  • Explanation / postprocessing framework
  • Project layout (what each folder contains)
  • Contributing, testing and license

Installation

Prerequisites

  • Python 3.10 (the project is tested with Python 3.10.x)
  • Git

Recommended: create and use a dedicated virtual environment (venv, conda, or pyenv) or use Poetry to manage the project environment and dependencies.

Using a plain venv (optional)

# create and activate a venv (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1

# confirm Python version
python --version

Poetry-based installation (recommended)

This project includes a pyproject.toml, so the recommended way to manage the environment is via Poetry. Poetry will create an isolated virtual environment and install dependencies declared in pyproject.toml.

  1. Install Poetry (follow https://python-poetry.org/docs/ for the latest instructions). Example install command (PowerShell):
# install Poetry (using pipx)
pipx install poetry
  1. Ensure Poetry uses a compatible Python interpreter (matching pyproject.toml). You can point Poetry to a specific local Python executable, or to a supported minor version if available:
# point poetry to an existing Python interpreter (example)
poetry env use C:\\Python311\\python.exe
# or use a version specifier if installed (example)
poetry env use 3.11
  1. Install dependencies and create the virtual environment:
poetry install
  1. Run commands inside the Poetry environment:
poetry run python -c "import src; print('src import OK')"
poetry run python experiments\clustering_simulation.py

Note: If you deliberately want to use Python 3.10 but pyproject.toml requires >=3.11, either update pyproject.toml to accept 3.10, or install and use a Python interpreter matching the current pyproject.toml constraint.


Dataset generation

Datasets generated by this project are stored under experiments/datasets/ and follow a consistent layout per dataset (MNIST, FMNIST, CIFAR10, PATHMNIST, BLOODMNIST, ...). Each dataset has subfolders for split type (nonoverlaping / overlaping), balancing (balanced / imbalanced) and client counts (15 / 30).

Generation scripts are located inside the corresponding dataset folders. Example path for a split generator:

experiments/datasets/<DATASET>/<split_type>/<balance>/<num_clients>/data_generation_split.py

When you run a data generation script it creates:

  • A full dataset in HuggingFace dataset format
  • A cached (Apache Arrow) format for faster loading during simulations
  • A blueprint CSV file (used to inspect client partitions)

Important notes

  • Simulations use the cached (Arrow) format for performance. If you move a cached dataset between machines (or generate on a different OS), the cached format may not be portable — in such cases transfer the full HuggingFace dataset instead.
  • Blueprints are CSVs that list the partition information and are useful for quick inspection and reproducibility.

Example: generate a dataset (PowerShell)

cd experiments/datasets/FMNIST/nonoverlaping/balanced/15
python data_generation_split.py

Running simulations

Simulation drivers are located under experiments/ (for example: clustering, temperature, centralised tests). Each dataset folder also contains example scripts for centralised experiments.

Common simulation scripts (examples)

  • experiments/clustering_simulation.py — runs the federated clustering experiments used in the paper
  • experiments/temperature/simulation_script.py — temperature-based simulations (see folder for specifics)
  • experiments/centralised_tests/<DATASET>/simulation_script.py — centralised baselines and tests

Typical usage (PowerShell)

cd experiments
python clustering_simulation.py

Configuration

  • Most simulation scripts contain an if __name__ == '__main__': block listing datasets, number of clients, and other hyper-parameters. Edit those lists to control which experiments are run, or import the main functions and call them programmatically.

Logs and outputs

  • Numerical results and intermediate outputs are saved under experiments/results/ (subfolders for clustering, temperature, etc.).

Explanation generation (model explanations)

The explanation framework is located at experiments/model_explanations/scripted_experiments_framework.py and helper utilities in experiments/model_explanations/.

Before running the explanation scripts set the following global mounts inside scripted_experiments_framework.py:

  • DATASET_MOUNT — root directory containing datasets (default experiments/datasets)
  • MODEL_MOUNT — root directory containing stored models used for explanations
  • RESULTS_MOUNT — directory with numerical results (e.g., client-cluster attribution)
  • OUTPUT_MOUNT — where the generated explanations and plots will be written

This decoupling allows you to store large datasets/models on another drive or network mount and still run the framework locally.

Run the framework (example)

cd experiments/model_explanations
python scripted_experiments_framework.py

Project layout (high level)

  • experiments/ — dataset generation scripts, simulation drivers, and result folders
    • datasets/ — scripts and generated datasets (by dataset name and split)
    • model_explanations/ — explanation generation utilities and frameworks
    • centralised_tests/ — centralised experiment scripts for baselines
    • results/ — generated numeric and visual results
  • src/ — main library source code
    • aggregators/ — aggregation strategies and implementations
    • data_structures/ — data structures used by the simulations
    • files/ — IO utilities, handlers, and logging helpers
    • model/ — federated model wrapper
    • net_templates/ — model templates (MNIST, FMNIST, ResNet adjustments)
    • node/ — federated node implementation
    • operations/ — evaluation routines and orchestration helpers
    • simulation/ — core simulation loop and helpers
    • utils/ — misc utilities (computations, splitters, animation)
  • pyproject.toml — project metadata and dependency declarations
  • README.md — this file

File examples

  • Blueprint CSVs: experiments/datasets/<DATASET>/.../<N>/<DATASET>_<N>_dataset_blueprint.csv
  • Dataset pointers (pickled caches or arrows): *_dataset_pointers

Source code (src/) structure

The src/ package contains reusable components used by the simulation drivers and experiment scripts. Brief description of subpackages and key files:

  • src/aggregators/

    • aggregator.py — base aggregator interfaces and shared utilities
    • fedopt_aggregator.py — FedOpt-style aggregation implementations
    • distances.py — distance functions used by clustering or similarity measures
    • temperature.py — utilities and algorithms for temperature-based aggregation
  • src/data_structures/

    • cluster_sturcutre.py — data structures for storing cluster metadata and client-cluster attributions
  • src/files/

    • archive.py — archival helpers for moving or compressing outputs
    • handlers.py — file handling utilities for datasets, pointers, and blueprints
    • loggers.py — logging wrappers used across experiments
  • src/model/

    • federated_model.py — model wrapper used by nodes and orchestrators (training / evaluation helpers)
  • src/net_templates/

    • mnist_model.py, fmnist_model.py — lightweight model templates for MNIST/FMNIST experiments
    • resnet_adjusted.py — ResNet modifications for CIFAR / larger experiments
  • src/node/

    • federated_node.py — node-level logic: local training, gradient computation, and communication hooks
  • src/operations/

    • evaluations.py — evaluation metrics and reporting utilities
    • orchestrations.py — high-level orchestration helpers used by simulation drivers
  • src/simulation/

    • simulation.py — core simulation loop, experiment harness and utilities used by drivers
  • src/utils/

    • computations.py — numeric helpers and small math utilities
    • splitters.py — data splitting utilities used for creating client partitions
    • select_gradients.py — helpers for selecting/filtering gradients for privacy or compression experiments
    • animation.py — small utilities for visualising results or producing animations

This structure keeps experiment drivers lean and reusable. For most research adaptations, modify either a simulation script under experiments/ or extend/replace components in src/.


How code is organized for experiments

There are two main modes to modify experiments:

  1. Direct substitution (recommended for quick experiments)

    • Edit the simulation script under experiments/ (e.g., clustering_simulation.py) to change datasets, client counts, models, or hyperparameters.
  2. Source modification (advanced)

    • Modify the library code under src/ to implement new aggregation methods, clustering algorithms, or model templates. This is for advanced customization and reusability.

Contributing

Contributions are welcome. Please follow these guidelines:

  1. Fork the repository and create a feature branch.
  2. Keep changes focused and add small tests where appropriate.
  3. Open a pull request with a clear description and the motivation for the change.

If you are planning to add large datasets or pre-trained models, consider adding them via an external mount and documenting the mount locations in scripted_experiments_framework.py rather than committing large binary files to git.


License

This project includes a LICENSE file in the repository root. Refer to it for license terms.


Contact / Support

If you have questions about running experiments, dataset generation, or using the explanation framework, open an issue in the repository with reproduction steps and relevant logs.


A note on reproducibility

To reproduce experiments from the paper:

  1. Use Python 3.10 and install dependencies used during development.
  2. Generate datasets using the provided generation scripts under experiments/datasets/.
  3. Run the simulation drivers (for example clustering_simulation.py) with the lists of datasets and client cardinalities configured in the script.
  4. Use the explanation framework and point its mounts at dataset/model/result folders (see experiments/model_explanations/scripted_experiments_framework.py).

Good luck, and feel free to request any targeted examples or helper scripts to speed up reproducible runs.

About

Code package for designing experiments with Federated Clustered Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors