Skip to content

ICICLE-ai/ArrayMorph

Repository files navigation

ArrayMorph

Build Status License: MIT

ArrayMorph is a software to manage array data stored on cloud object storage efficiently. It supports both HDF5 C++ API and h5py API. The data returned by h5py API is numpy arrays. By using h5py API, users can access array data stored on the cloud and feed the read data into machine learning pipelines seamlessly.

Tag: CI4AI


How-To Guides

Install dependencies

It is recommended to use Conda (and conda-forge) for managing dependencies.

  1. Install Miniconda
  2. Install conda-build for installing local conda packages
  3. Create and activate environment with dependencies:
    conda create -n arraymorph conda-forge::gxx=9
    conda activate arraymorph
    conda install -n arraymorph cmake conda-forge::hdf5=1.14.2 conda-forge::aws-sdk-cpp conda-forge::azure-storage-blobs-cpp conda-forge::h5py

Install ArrayMorph via ArrayMorph local conda package

git clone https://github.com/ICICLE-ai/arraymorph.git
cd arraymorph/arraymorph_channel
conda index .
conda install -n arraymorph arraymorph -c file://$(pwd) -c conda-forge

Install ArryMorph from source code

Build ArrayMorph

git clone https://github.com/ICICLE-ai/arraymorph.git
cd arraymorph/arraymorph
cmake -B ./build -S . -DCMAKE_PREFIX_PATH=$CONDA_PREFIX
cd build
make

Enable VOL plugin:

export HDF5_PLUGIN_PATH=/path/to/arraymorph/arraymorph/build/src
export HDF5_VOL_CONNECTOR=arraymorph

Configure Environment for Cloud Access

AWS Configuration:

export STORAGE_PLATFORM=S3
export BUCKET_NAME=XXXXXX
export AWS_ACCESS_KEY_ID=XXXXXX
export AWS_SECRET_ACCESS_KEY=XXXXXX
export AWS_REGION=us-east-2  # or your bucket's region

Azure Configuration:

export STORAGE_PLATFORM=Azure
export BUCKET_NAME=XXXXXX
export AZURE_STORAGE_CONNECTION_STRING=XXXXXX

Tutorials

Run a simple example: Writing and Reading HDF5 files from Cloud

Prerequisites:

  • AWS or Azure cloud account with credentials
  • S3 bucket or Azure container
  • ArrayMorph dependencies installed

Steps:

  1. Activate conda environment

    conda activate arraymorph
  2. Write sample HDF5 data to the cloud

    cd examples/python
    python3 write.py
  3. Read data back from cloud HDF5 file

    cd examples/python
    python3 read.py

Explanation

How ArrayMorph Works

ArrayMorph plugs into the HDF5 stack using a VOL (Virtual Object Layer) plugin that intercepts file operations and routes them to cloud object storage instead of local files. This allows existing HDF5 APIs (both C++ and h5py in Python) to operate on cloud-based data seamlessly, enabling transparent cloud access for scientific or ML pipelines.

It supports:

  • Cloud backends: AWS S3 and Azure Blob
  • File formats: Current binary data stream (we plan to extend to other formats like jpg in the future)
  • Languages: C++ and Python (via h5py compatibility)

The system is designed to be efficient in latency-sensitive scenarios and aims to integrate well with large-scale distributed training and inference.


References


Acknowledgements

This project is supported by:

National Science Foundation (NSF) funded AI institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) (OAC 2112606)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •