Skip to content

Benecoder/pygrid-gensiscloud

Repository files navigation

PyGrid Distributor for Genesis Cloud

This project is meant to automatize the distribution of a PyGrid training job across multiple Genesis Cloud compute instances. The training protocol is specified on a local desktop. The distributor then launches a PyGrid distributed across multiple Genesis Cloud instances. Each instance is home to one worker node and one instance runs the gateway.

Warning

Executing this script will start a series of instances, the runtime of which will count against your account credits. Make sure to shut down any idle computing power so as not to accumulate accidental costs. This is solely a proof of concept and, as stated in the license, none of the functionality is guaranteed.

How-to using a jupyter notebook:

  1. Set up a Genesis Cloud Account
  2. Generate your SSH Key and Developer API token.

  3. Create a security group called pygrid that includes a inbound TCP connection for the ports 3000 and 5000

  4. Clone this repository to your local machine

     git clone https://github.com/Benecoder/distributor.git
     cd distributor
    
  5. Install the required dependencies. It is recommended to do so using your preferred virtual machine. For compatibility down the road, make sure that you are using python version 3.7. Here is how you would do that using anaconda.

     conda create -n pygrid_env python=3.7
     conda activate pygrid_env
     pip install -r requirements.txt
    

As an example a jupyter notebook is provided in model.ipynb. For running the example, install jupyter lab (or notebook) and place your Genesis Cloud API token and the name of your SSH Key in environment variables.

conda install jupyter notebook
export GC_API_TOKEN="your-api-token"
export GC_SSH_KEY_NAME="your-ssh-key-name"

Under the hood:

This is how the network is build up: Using the developer API build_docker_image.py creates a new ubuntu 18 instance and installs the GPU drivers. All of the PyGrid Software that is used is obtained by pulling the latest docker containers and starting a redis server using docker-compose and the correct docker-compose file. To make sure docker, docker-compose and the correct GPU drivers are installed.The first instance needs to be build using the base_image_cloud_init.yml file. This cloud-init file makes sure the correct software is installed and once that is the case touches the /home/ubuntu/installation_finished file.

On The master node it then starts a redis server based on the information in the docker-compose.yml file. This houses one gateway node and one worker, called brutus.

As soon as main.py receives the ip for the gateway it starts a additional series of workers. These connect to the gateway using the private network between the instances.

About

Distributing the workload of a training job across mutiple Genesis Cloud Instances

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors