ArrayMorph is a software to manage array data stored on cloud object storage efficiently. It supports both HDF5 C++ API and h5py API. The data returned by h5py API is numpy arrays. By using h5py API, users can access array data stored on the cloud and feed the read data into machine learning pipelines seamlessly.
Tag: CI4AI
It is recommended to use Conda (and conda-forge) for managing dependencies.
- Install Miniconda
- Install conda-build for installing local conda packages
- Create and activate environment with dependencies:
conda create -n arraymorph conda-forge::gxx=9 conda activate arraymorph conda install -n arraymorph cmake conda-forge::hdf5=1.14.2 conda-forge::aws-sdk-cpp conda-forge::azure-storage-blobs-cpp conda-forge::h5py
git clone https://github.com/ICICLE-ai/arraymorph.git
cd arraymorph/arraymorph_channel
conda index .
conda install -n arraymorph arraymorph -c file://$(pwd) -c conda-forgegit clone https://github.com/ICICLE-ai/arraymorph.git
cd arraymorph/arraymorph
cmake -B ./build -S . -DCMAKE_PREFIX_PATH=$CONDA_PREFIX
cd build
makeexport HDF5_PLUGIN_PATH=/path/to/arraymorph/arraymorph/build/src
export HDF5_VOL_CONNECTOR=arraymorphexport STORAGE_PLATFORM=S3
export BUCKET_NAME=XXXXXX
export AWS_ACCESS_KEY_ID=XXXXXX
export AWS_SECRET_ACCESS_KEY=XXXXXX
export AWS_REGION=us-east-2 # or your bucket's regionexport STORAGE_PLATFORM=Azure
export BUCKET_NAME=XXXXXX
export AZURE_STORAGE_CONNECTION_STRING=XXXXXX- AWS or Azure cloud account with credentials
- S3 bucket or Azure container
- ArrayMorph dependencies installed
-
Activate conda environment
conda activate arraymorph
-
Write sample HDF5 data to the cloud
cd examples/python python3 write.py -
Read data back from cloud HDF5 file
cd examples/python python3 read.py
ArrayMorph plugs into the HDF5 stack using a VOL (Virtual Object Layer) plugin that intercepts file operations and routes them to cloud object storage instead of local files. This allows existing HDF5 APIs (both C++ and h5py in Python) to operate on cloud-based data seamlessly, enabling transparent cloud access for scientific or ML pipelines.
It supports:
- Cloud backends: AWS S3 and Azure Blob
- File formats: Current binary data stream (we plan to extend to other formats like jpg in the future)
- Languages: C++ and Python (via h5py compatibility)
The system is designed to be efficient in latency-sensitive scenarios and aims to integrate well with large-scale distributed training and inference.
This project is supported by:
National Science Foundation (NSF) funded AI institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) (OAC 2112606)