This tutorial demonstrates how to set up centralized storage for ML resources, avoiding redundant downloads and optimizing storage usage across your organization.
This tutorial addresses two common challenges in shared computing environments:
- Dataset Redundancy: Multiple users downloading the same large datasets repeatedly
- Model Storage Overhead: Each user maintaining separate copies of popular pretrained models
By centralizing these resources, you can significantly reduce network bandwidth usage, save storage space, and accelerate research workflows.
The first half of this tutorial demonstrates how to download and store common AI datasets in a centralized location accessible to all users. This eliminates the need for each researcher to download datasets individually, saving time and resources.
We provide a custom Python module called makerspace_ds_mgr that simplifies working with shared datasets. This module:
- Scans and organizes datasets by format (ARRAYRECORD, TFRECORD, PARQUET, PYTORCH) and use case (IMAGE, TEXT, MEDICAL, VIDEO, etc.)
- Provides a simple query interface to locate datasets
- Returns absolute paths to dataset directories for easy integration with existing code
- Supports flexible directory structures adaptable to your organization's needs
Key Features:
from makerspace_ds_mgr import DatasetMgr
# Initialize with your shared dataset directory
mgr = DatasetMgr(base_dir="/path/to/shared/datasets")
# View all available datasets
mgr.show_datasets()
# Get the path to a specific dataset
dataset_path = mgr.query_dataset("imagenette2")The second half focuses on hosting popular pretrained models from the HuggingFace Hub. By maintaining centralized copies of commonly-used models, you eliminate the need for each user to download these large files individually.
Important Note: The shared models are intended for inference and evaluation. If you need to fine-tune or continue training a model, you should download a local copy to avoid conflicts with other users.
The HuggingFace Hub is an online platform that serves as a central repository for sharing and discovering machine learning resources. Think of it like GitHub, but specialized for models, datasets, and ML applications. Users can:
- Upload and share their own models or datasets
- Access thousands of pretrained models and datasets shared by the community
- Track versions and collaborate with other researchers
- Discover state-of-the-art models for various tasks
We use the snapshot_download function from the huggingface_hub Python package to download and store models locally. This function:
- Downloads a complete snapshot of a repository (model or dataset) hosted on the HuggingFace Hub
- Creates an immutable copy of all repository contents at a specific commit or revision
- Stores files locally for consistent, fast access without requiring internet connectivity
- Ensures reproducibility by capturing the exact state of the model at download time
Example:
from huggingface_hub import snapshot_download
local_path = snapshot_download(
repo_id="microsoft/resnet-50",
local_dir="/shared/models/resnet-50",
local_dir_use_symlinks=False
)Using the stored model snapshots is straightforward:
- Select the model you want to use from your shared directory
- Locate its snapshot directory (e.g.,
/shared/models/resnet-50) - Visit the model's HuggingFace page (e.g., https://huggingface.co/microsoft/resnet-50)
- Review the usage example on the model card
- Modify the code to point to your local snapshot instead of downloading from the hub
The key is to use the local path and set local_files_only=True where applicable:
from transformers import AutoImageProcessor, ResNetForImageClassification
# Load from shared directory instead of downloading
processor = AutoImageProcessor.from_pretrained(
"/shared/models/resnet-50",
local_files_only=True
)
model = ResNetForImageClassification.from_pretrained(
"/shared/models/resnet-50"
)This tutorial includes two complete examples:
- Demonstrates loading the Microsoft ResNet-50 model from a shared directory
- Shows how to use the shared Imagenette2 dataset
- Includes inference on both dataset images and custom images
- Illustrates using the BERT base uncased model for masked language modeling
- Shows the fill-mask pipeline with locally stored models
- Python 3.7+
- Virtual environment (recommended)
# Create and activate a virtual environment
python -m venv my_env
source my_env/bin/activate # On Windows: my_env\Scripts\activate
# Install required packages
pip install -r requirements.txt