Skip to content

Determine / document which systems we intend to support training / executing models on #154

@hughes036

Description

@hughes036

We have already had quite a bit of trouble running example training on the systems that we have access to (our macbooks, Ubuntu workstations). As of now, here is what we have attempted, the result, and the blockers:

System Result Blocker Details Solution Related Issue
macOS Monterey 12.6 Intel i7 Fails C++ compile error (at runtime) fatal error: 'omp.h' file not found. AllenCell/cyto-dl#184
macOS Ventura 13.3.1 Intel i7 Fails C++ compile error (at runtime) libomp.dylib not found. brew install libomp
export DYLD_LIBRARY_PATH=/usr/local/opt/libomp/lib:/usr/local/lib
macOS Monterey 12.4 Apple M1 Fails C++ compile error (at runtime) fatal error: 'omp.h' file not found. AllenCell/cyto-dl#184
Ubuntu 16 Fails GPU driver runtime error RuntimeError: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver. Update GPU driver from nvidia.com
OR install PyTorch version compiled with current CUDA driver.
Ubuntu 20 (EC2) Succeeds . . . .
Slurm (CPU) . . . . .
Slurm (GPU) . . . . .
AWS cluster (GPU) . . . . .

In all cases, the setup steps were:

  • Create a fresh venv based on Python 3.8 or 3.9
  • upgrade pip
  • pip install wheel
  • pip install boto3
  • pip install -e .
  • pip install requirements/requirements.txt
  • python scripts download_test_data.py

And the experiment run was python aics_im2im/train.py experiment=im2im/segmentation.yaml trainer=cpu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions