Skip to content

A guide on hosting and managing shared HuggingFace models and datasets. Demonstrates how to set up centralized storage, avoid redundant downloads across users, and efficiently serve ML resources

Notifications You must be signed in to change notification settings

pace-gt/AI-Makerspace-shared-resources-tutorial

Repository files navigation

AI Makerspace Shared Resources Tutorial

This tutorial demonstrates how to set up centralized storage for ML resources, avoiding redundant downloads and optimizing storage usage across your organization.

Overview

This tutorial addresses two common challenges in shared computing environments:

  1. Dataset Redundancy: Multiple users downloading the same large datasets repeatedly
  2. Model Storage Overhead: Each user maintaining separate copies of popular pretrained models

By centralizing these resources, you can significantly reduce network bandwidth usage, save storage space, and accelerate research workflows.

Part 1: Shared Datasets

The first half of this tutorial demonstrates how to download and store common AI datasets in a centralized location accessible to all users. This eliminates the need for each researcher to download datasets individually, saving time and resources.

The makerspace_ds_mgr Module

We provide a custom Python module called makerspace_ds_mgr that simplifies working with shared datasets. This module:

  • Scans and organizes datasets by format (ARRAYRECORD, TFRECORD, PARQUET, PYTORCH) and use case (IMAGE, TEXT, MEDICAL, VIDEO, etc.)
  • Provides a simple query interface to locate datasets
  • Returns absolute paths to dataset directories for easy integration with existing code
  • Supports flexible directory structures adaptable to your organization's needs

Key Features:

from makerspace_ds_mgr import DatasetMgr

# Initialize with your shared dataset directory
mgr = DatasetMgr(base_dir="/path/to/shared/datasets")

# View all available datasets
mgr.show_datasets()

# Get the path to a specific dataset
dataset_path = mgr.query_dataset("imagenette2")

Part 2: Shared Models

The second half focuses on hosting popular pretrained models from the HuggingFace Hub. By maintaining centralized copies of commonly-used models, you eliminate the need for each user to download these large files individually.

Important Note: The shared models are intended for inference and evaluation. If you need to fine-tune or continue training a model, you should download a local copy to avoid conflicts with other users.

Understanding HuggingFace Hub

The HuggingFace Hub is an online platform that serves as a central repository for sharing and discovering machine learning resources. Think of it like GitHub, but specialized for models, datasets, and ML applications. Users can:

  • Upload and share their own models or datasets
  • Access thousands of pretrained models and datasets shared by the community
  • Track versions and collaborate with other researchers
  • Discover state-of-the-art models for various tasks

Downloading Models with snapshot_download

We use the snapshot_download function from the huggingface_hub Python package to download and store models locally. This function:

  • Downloads a complete snapshot of a repository (model or dataset) hosted on the HuggingFace Hub
  • Creates an immutable copy of all repository contents at a specific commit or revision
  • Stores files locally for consistent, fast access without requiring internet connectivity
  • Ensures reproducibility by capturing the exact state of the model at download time

Example:

from huggingface_hub import snapshot_download

local_path = snapshot_download(
    repo_id="microsoft/resnet-50",
    local_dir="/shared/models/resnet-50",
    local_dir_use_symlinks=False
)

Using Stored Models

Using the stored model snapshots is straightforward:

  1. Select the model you want to use from your shared directory
  2. Locate its snapshot directory (e.g., /shared/models/resnet-50)
  3. Visit the model's HuggingFace page (e.g., https://huggingface.co/microsoft/resnet-50)
  4. Review the usage example on the model card
  5. Modify the code to point to your local snapshot instead of downloading from the hub

The key is to use the local path and set local_files_only=True where applicable:

from transformers import AutoImageProcessor, ResNetForImageClassification

# Load from shared directory instead of downloading
processor = AutoImageProcessor.from_pretrained(
    "/shared/models/resnet-50", 
    local_files_only=True
)
model = ResNetForImageClassification.from_pretrained(
    "/shared/models/resnet-50"
)

Tutorial Examples

This tutorial includes two complete examples:

1. Image Classification with ResNet-50

  • Demonstrates loading the Microsoft ResNet-50 model from a shared directory
  • Shows how to use the shared Imagenette2 dataset
  • Includes inference on both dataset images and custom images

2. Text Processing with BERT

  • Illustrates using the BERT base uncased model for masked language modeling
  • Shows the fill-mask pipeline with locally stored models

Prerequisites

  1. Python 3.7+
  2. Virtual environment (recommended)

Installation

# Create and activate a virtual environment
python -m venv my_env
source my_env/bin/activate  # On Windows: my_env\Scripts\activate

# Install required packages
pip install -r requirements.txt

About

A guide on hosting and managing shared HuggingFace models and datasets. Demonstrates how to set up centralized storage, avoid redundant downloads across users, and efficiently serve ML resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published