Skip to content

neuraptic/octopuscl

Repository files navigation

OctopusCL

Format & Style Tests PyPI Publication

Introduction

OctopusCL is a framework for building and experimenting with multimodal models in continual learning scenarios. It is composed of the following components:

  • Dataset Manager: A tool for managing datasets in a centralized repository called Dataset Repository.
  • Experiment Manager: A tool for running and tracking experiments in a centralized registry called Experiment Registry.
  • Model Manager (not available yet)

Installation

You can install OctopusCL with pip:

pip install octopuscl-lib

Usage

This section provides all the necessary instructions and examples to effectively use OctopusCL, covering dataset building and management, and experiment building, execution, and tracking.

Environments

OctopusCL can run in three different environments:

  • Development: The local environment for developing, debugging and testing new code.
  • Staging: The staging environment is a mirror of the production environment. It is used for testing new code before deploying it to production, providing a final check to ensure that everything behaves as expected in a production-like environment.
  • Production: The production environment is the live environment where the final code is deployed.

Dataset Manager

The Dataset Manager allows for uploading and downloading datasets to and from the Dataset Repository. It is accessible through the octopuscl/scripts/run_dataset_manager.py script (more information about the arguments on the Managing datasets section below).

Concepts

  • Dataset. A collection of data that is used to train and evaluate AI models. It is composed of the following parts:
    • Schema. The dataset schema is a vital component that guarantees the structure and format of data are consistent. Beyond providing basic information such as the dataset name or description, the schema specifies the inputs, outputs, and metadata fields that AI models will use to train and make predictions.
    • Examples. The actual data to be used to train and evaluate AI models. They must adhere to the dataset schema.
    • Files: Optional files that may be referenced by the examples. They can be images, audio files, documents, or any other types of files that AI models need to process.
    • Splits: Optional pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).
  • Dataset Repository. A centralized storage solution, hosted on Amazon S3, where all datasets are stored. It serves as the backbone for dataset management, providing a scalable and secure location for storing and retrieving datasets.

Building datasets

A dataset must be stored in a dedicated directory that contains the following files and directories:

  • schema.json: The JSON file with the dataset schema.
  • examples.csv or examples.db: The CSV file or SQLite database containing the examples of the dataset.
  • files (optional): The directory that contains all the files referenced by the examples.
  • splits (optional): A directory containing pre-defined splits that determine how examples are distributed across experiences and partitions (training, test, validation).

See docs/datasets.md for detailed instructions on how to build datasets.

Managing datasets

To upload or download a dataset, run octopuscl/scripts/run_dataset_manager.py with the following arguments:

  • -e, --environment (required): The environment where the dataset will be uploaded or downloaded. It can be development, staging, or production.
  • -a, --action (required): Either upload or download.
  • -l, --local_path (required): Path to the local file or directory.
  • -d, --dataset (optional): Name of the dataset. Required when downloading.
  • -r, --remote_path (optional): Path to the remote file or directory. Required when downloading. Ensure it ends with a / for directories.

Examples:

  • Uploading a dataset.

    cd octopuscl/scripts
    ./run_dataset_manager.py -e production -a upload -l /path/to/local/dataset/
  • Downloading a dataset.

    cd octopuscl/scripts
    ./run_dataset_manager.py -e production -a download -l /local/path/ -d dataset_name -r /remote/path/

Experiment Manager

The Experiment Manager allows for running and tracking experiments. It is accessible through the octopuscl/scripts/run_experiments.py script (more information about the arguments on the Running experiments section below).

Concepts

The Experiment Manager is built upon the following concepts:

  • Experiment: The process of evaluating a set of AI models, pipelines, or workflows under specific conditions. An experiment should define a clear objective that must be common to all the executions belonging to the experiment.
    • Datasets: The set of datasets used in the experiment.
    • Splitter: The method used to split the datasets into training, validation, and test sets.
    • Metrics: The metrics used to evaluate the models.
    • Artifacts: The artifacts generated after training or evaluating a model (e.g., training curves, ROC curves, etc.).
    • Trials: The set of ML pipelines and workflows that will be run in each dataset. A trial is defined by the following parts:
      • Pipeline: The sequence of steps that will be executed to train and evaluate the model.
        • Model: The AI model.
        • Transformations: The sequence of transformations that will be applied to the data before training or evaluating the model.
      • Data loaders: The method used to load data from the datasets. Data loaders handle batches, shuffling, loading parallelization, etc.
    • Runs: The actual execution of trials. The number of runs in a trial depends on the splitting strategy chosen for the experiment (e.g., in a 5-fold cross-validation, there will be 5 runs for each trial).
  • Experiment plan: The set of experiments to be conducted.

Building experiments

An experiment is defined in a YAML file that must follow the structure described in docs/experiments.md. An experiment plan is given by a dedicated directory that contains the YAML files defining the experiments.

In addition to the definition and configuration of the experiments, many functionalities and components can be customized, including AI models, transformations, metrics, and artifacts, among others. See docs/customization.md for detailed instructions on how to implement custom classes.

Running experiments

To run experiments, simply run octopuscl/scripts/run_experiments.py with the following arguments:

  • -e, --environment (required): The environment where the experiments will be run. It can be development, staging, or production.
  • -d, --directory (required): Path to the directory that contains the YAML files defining the experiments.

In staging and production environments, trials can be run either locally or on AWS EC2.

Tracking experiments

The Experiment Manager delegates experiment tracking to MLflow, which provides a web-based UI called MLflow Tracking UI.

Requirements

General

Regardless of the environment in which you are running the trials, the following requirements must be met:

Additionally, you will need to set specific environment variables (see octopuscl.env) depending on the environment in which you are running the trials.

Development Environment

If you need to download or upload datasets from or to the Dataset Repository, you must have:

  • AWS setup (AWS CLI, keys, users, roles, policies, S3 bucket)

Staging & Production Environments

In the staging and production environments, the following requirements must be met:

  • AWS setup (AWS CLI, keys, users, roles, policies, S3 bucket)
  • Prebuilt Docker image with OctopusCL installed, accessible via AWS ECR
  • Publicly accessible MLflow tracking server

If you are running trials locally, you must also have:

Contributions

See contributing guidelines.

Maintainers

OctopusCL is maintained by the following individuals (in alphabetical order):

About

A framework for building and experimenting with multimodal models in continual learning scenarios

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages