From 107e9eb4fc0e61a01a332d3fd7bd3125880feba9 Mon Sep 17 00:00:00 2001 From: motiwari Date: Fri, 4 Jun 2021 01:42:02 -0700 Subject: [PATCH 1/9] Large rewrite and automation of elasticlusterjob-tutorial --- .../elasticlusterjob-tutorial.md | 190 ++++++++++++++++++ elasticluster_tutorial/sample_cj_config | 4 + elasticluster_tutorial/sample_cj_ssh_config | 23 +++ .../sample_elasticluster_config | 65 ++++++ elasticluster_tutorial/setup.sh | 69 +++++++ 5 files changed, 351 insertions(+) create mode 100644 elasticluster_tutorial/elasticlusterjob-tutorial.md create mode 100644 elasticluster_tutorial/sample_cj_config create mode 100644 elasticluster_tutorial/sample_cj_ssh_config create mode 100644 elasticluster_tutorial/sample_elasticluster_config create mode 100644 elasticluster_tutorial/setup.sh diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md new file mode 100644 index 0000000..e4efa13 --- /dev/null +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -0,0 +1,190 @@ +# ElastiCluster and Clusterjob Tutorial +Painless computing requires one to find tools to help with the job at hand. This class starts with two older but stable tools used in statistics research today -- [Elasticluster](https://elasticluster.readthedocs.io/en/latest/) and [ClusterJob](https://clusterjob.org). The other pattern we will observe is that painless computing depends upon running many virtual operating system images. In this class we will mostly use the 2018 LTS version of Ubuntu Linux. (LTS means it has Long Term Support. In other words, there should be few to no surprises for the unwary scientist. This is a Good Thing™.) Ubuntu Linux is one of several options that are widely deployed on most clouds. We do not believe that there is much to be gained from exposing you to pointless variation in OS choice. Hence, we will always launch our jobs from a virtual image on your laptop/desktop computer. This allows you to experiment without potentially misconfiguring your host machine. Stanford recommends that you use [VirtualBox from Oracle Systems](https://www.virtualbox.org). Because "… a foolish consistency is the hobgoblin of little minds …" applies with a fearsome regularity to computer systems, we recommend that you use the server version of the [Ubuntu OS](https://releases.ubuntu.com/18.04/). It allows you to become familiar with the computing environment upon which your experiments will run. + +## The "Big Picture" +Your experiment will run on a cluster of computers defined by you, either on the Stanford Sherlock Cluster or on the Google Compute Engine. Our initial experiment is based upon a simple research problem created by Mahsa Lofti to calculate a phase transition. It will use four standard sized machines to do the experimental calculation and a frontend/coordination system. In future work, we can start to require access to high performance GPUs or other hardware. As you add specialized hardware, the cost and competition for access increases. + +One piece of advice, you should read through this tutorial once before trying the commands yourself. Like any construction process, knowing where you're going and how you're going to get there really helps you along the way. + +This tutorial is broken up into several steps: +- Step 0: Sign up for Google Cloud and ClusterJob +- Step 1: Install VirtualBox and create a blank VM +- Step 2: Install Ubuntu 18.04 LTS on the VM +- Step 3: Enable SSH on your Ubuntu VM +- Step 4: Install elasticluster and clusterjob +- Step 5: Test elasticluster +- Step 6: Test clusterjob +- Step 7: Create a high memory cluster and use Clusterjob to start the phase transition code. + - Wait approximately 3 hours for the job to complete + - Use CJ status commands to see the state of the calculation. + - SSH into the compute nodes to see how much CPU is being used. +- Step 8: Gather all of the computed results and share them with your instructors. + +# Step 0: Sign up for Google Cloud and ClusterJob + +## Step 0.a: Set up Google Cloud credentials. +Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. +- Sign up for a [free Google Cloud account](https://cloud.google.com/free). +- Start at [GCP Dashboard](https://console.cloud.google.com/) +- Create a project from "IAM & Admin" menu choose "Create a Project". + - This project name is typically a combination of two random words and a number, e.g. "`superb-garden-303018`". Take note of this project name. +- Make sure that billing is [configured for your account](https://cloud.google.com/billing/docs/how-to/modify-project). +- Enable Compute Engine API and Firebase API in "[APIs & Services](https://console.cloud.google.com/apis)" +- Enable Project Credentials in the [Credentials menu](https://console.cloud.google.com/apis/credentials) on the "[APIs & Services](https://console.cloud.google.com/apis)" dashboard. + - Select your Google Cloud project. + - Click Create credentials. + - Click "OAuth client ID". + - In the "Create client ID" page, for Application type, select "`Desktop`". + - Take note of the "Client ID" and "Client Secret". These will be required later for Elasticluster to create the cluster. As these credentials control access to your account, treat them with care; your budget may be at risk. (You can get them again if you lose them. Do not share them with classmates.) + - Example Client ID: "`308342824695-eimtr7e8bqo7lotqlumj5mfmta1co8o4.apps.googleusercontent.com`" + - Example Client Secret: "`_IdXWkmrunCuSLmPhL0ouaeV`" + - Choose a preferred server zone for Google Cloud from [this list](https://cloud.google.com/compute/docs/regions-zones). + +At the end of this process, you should have taken note of five different variables: your Google Cloud username (typically your email address), your project's ID number, client ID, client secret, and preferred server zone. + +## Step 0.b: Set up ClusterJob credentials + +Sign up for a ClusterJob account here using a .edu email address. Once you've verified your account and logged in, view your ClusterJob key. At the end of this process, you should have taken note of two more variables: your ClusterJob ID and key. + +# Step 1: Install VirtualBox and create a blank VM + +- Download and install VirtualBox >=6.1.22 from https://www.virtualbox.org/wiki/Downloads + +First, we create a blank VM with no OS installed: +- Open VirtualBox >> Tools >> New +- Choose a name and folder for the VM, set Type: Linux and Version: Oracle (64-bit) +- Set Memory Size to 8192 MB +- Choose "Create a virtual hard disk now" +- Choose VDI (VirtualBox Disk Image) +- Choose Dynamically allocated +- Choose the file location and size (12 GB is sufficient) + +# Step 2: Install Ubuntu 18.04 LTS on the VM + +Now, we install Ubuntu on the blank VM: +- Download the Ubuntu 18.04.5 LTS (Bionic Beaver) Server install image from https://releases.ubuntu.com/18.04/ +- Right click the new VM and click "Settings" +- Click on "Storage", and under "Controller: IDE" click on "Adds optical drive." (the blue circle with the green plus on it) +- Click "Add" again and choose the Ubuntu .iso downloaded above +- Press "OK" and Start the VM + +The VM should begin booting up from the Ubuntu installation image and begin prompting you for installation choices. If you need to navigate away from the VM, press one of your modifier keys (command on Mac) to release the mouse from the VM. + +When installing Ubuntu: +- Choose your language of choice +- Choose your keyboard configuration +- Choose the default network connections +- Leave proxy blank +- Choose default Ubuntu mirror +- Choose "Use an entire disk" and "Set up this disk as an LVM group". It is not necessary to encrypt the LVM group with LUKS. +- Choose the default storage configuration +- Press continue +- Set up your personal name, your server's name (machine name), your username on that machine, and the password for your username +- Check "Install OpenSSH server", do not import SSH identity + +Afterwards, the Ubuntu installation should complete in 5-10 minutes. Then choose "Reboot" and press enter. + +# Step 2.A: Miscellaneous Bugs on Ubuntu + +- If you're running into issues with random characters (like newlines) being inserted in your shell session, run the command `setterm -repeat off` +- If you're having trouble running `apt`, run the following command in your VM to set your nameserver in `/etc/resolv.conf`(some ISP's forwarding rules don't support DNS on your VM): `echo "sudo sed -i 's/nameserver.*/nameserver 8.8.8.8/g' /etc/resolv.conf" >> ~/.bashrc && source ~/.bashrc` + +# Step 3: Enable SSH on your Ubuntu VM + +Ubuntu Virtualboxes have known issues with clipboard sharing between host and guest. This makes it extremely difficult to do any development on the VM. As such, we need to enable SSH on our VM and we'll use that for development: + +- Shut down the Ubuntu VM if it's running +- In VirtualBox, go to your VM's Settings >> Network >> Adapter 1 >> Advanced >> Port Forwarding +- Add a new port-forwarding rule with Name: SSH, Protocol: TCP, Host IP: 127.0.0.1, Host Port: 2222, Guest IP: blank, Guest Port: 22. + +After starting the Ubuntu VM again, you should be able to ssh into it directly from your HOST machine (laptop) via `ssh YOUR_USERNAME@127.0.0.1 -p 2222` + +# Step 4: Install elasticluster and clusterjob + +SSH into your your running VM from a terminal on your HOST (laptop), then run the following commands, replacing the necesesary Google and Clusterjob variables that were determined when you signed up for them: + +``` +git clone https://github.com/motiwari/stats285.github.io.git + +echo "export GCE_USERNAME=" >> ~/.bashrc +echo "export GCE_ZONE=" >> ~/.bashrc +echo "export GCE_PROJECT_ID=" >> ~/.bashrc +echo "export GCE_CLIENT_ID=" >> ~/.bashrc +echo "export GCE_CLIENT_SECRET=" >> ~/.bashrc + +echo "export CJID=" >> ~/.bashrc +echo "export CJKEY=" >> ~/.bashrc + +source ~/.bashrc + +cd stats285.github.io/elasticluster_tutorial/ +chmod +x setup.sh +./setup.sh +``` + +# Step 5: Test elasticluster + +At this point, all the dependencies for elasticluster and clusterjob have been installed. To create a small memory cluster and establish communication to each node, run: +``elasticluster start gce``` +The `start` command provisions the nodes using Compute Engine and will take between 20-30 minutes. It configures the nodes by using the Ansible playbooks included in the Elasticluster source. Setup can take some time, depending on configuration. You will know when configuration is done when the output stops and you see the ending banner containing: "`Your cluster is ready!`" It is required practice that you update your `gcloud` keys after bringing up a new cluster using: +``` +gcloud compute config-ssh +``` +You can then login to the frontend node using: +``` +elasticluster ssh gce +``` +Or any of the nodes using: +``` +ssh gce-frontend001.$GCE_ZONE.$GCE_PROJECT_ID +ssh gce-compute001.$GCE_ZONE.$GCE_PROJECT_ID +ssh gce-compute002.$GCE_ZONE.$GCE_PROJECT_ID +ssh gce-compute003.$GCE_ZONE.$GCE_PROJECT_ID +ssh gce-compute004.$GCE_ZONE.$GCE_PROJECT_ID +``` + +These node names are important and they are created from information in your config file. Each node name contains your cluster, role, and number, e.g. "`gce-frontend001`" or "`gce-high-mem-compute002`". Followed by a zone/region designator, e.g. "`us-central1-a`". Finally, your project ID, e.g. "`superb-garden-303018`" is concatenated to make a fully qualified node name. The node name of the frontend will be needed for ClusterJob. + +One destroys a cluster, equally unsurprisingly, with a "`stop`" command: +``` +elasticluster stop gce +``` + +# Step 6: Test clusterjob +Now that we have a compute cluster, it is time to perform a calculation using it using ClusterJob. Like all research software, ClusterJob has [basic documentation](https://clusterjob.org/documentation/). This is augmented by a draft chapter of a [Data Science book by Hatef Monajemi](https://monajemi.github.io/datascience/pages/elasticluster-clusterjob-model). This tutorial is a distillation of these other works in the very pragmatic context of running a simple example for this class. Most users borrow an existing set of configuration files and call it a day. As we expand a cluster's hardware to include GPUs, the configuration files will evolve. Those extensions will be discussed in class. + +To test your ClusterJob installation: +``` +elasticluster start gce +# After the cluster is ready. +gcloud compute config-ssh + +# Run simpleExample.py +cd ~/CJ_install/example/Python/ +cj run simpleExample.py gce -m "Python." | yes +cj state +``` + +# Step 7: Run Phase Transition Code +Now we are going to calculate a phase transition code. The course CAs will describe the details of the code and what it is calculating in class. This tutorial will show you how to run it. First, get the code: +``` +cd ~ +git clone https://github.com/stats285/ExamplePhaseTransition ~/ExamplePhaseTransition +cd ~/ExamplePhaseTransition/ +``` +Now we are going to execute this task in parallel on the gce cluster and include the dependent code. +``` +cj parrun main_func.py gce -dep Dependents -m "Phase Transition" +``` +Now that it is running, you can check the state of the code utilizing: +``` +cj state +``` + +# Step 8: Gather all of the computed results and share them with your instructors. +When the job has completed, after about 3 hours, you will then need to get your results from the cluster by first `reducing` them and then `getting` them onto your local Ubuntu image. Because you may have many different jobs running, you will need to tell CJ which job to reduce and get. the `cj state` command also tells you the `PID`, process identifier, to allow you to reduce the right data. In the below example, `ff1cf89ab2f4c51800a900704dda041f637ca620` is a sample `PID`; yours will be different. +``` +cj reduce final_results.txt ff1cf89ab2f4c51800a900704dda041f637ca620 +cj get ff1cf89ab2f4c51800a900704dda041f637ca620 +``` +Now you get the scientific joy of determining what you just calculated and what it all means. Mazeltov. The CA will reveal all. Please copy your shell results to Stanford's Canvas system to get credit for performing this tutorial. diff --git a/elasticluster_tutorial/sample_cj_config b/elasticluster_tutorial/sample_cj_config new file mode 100644 index 0000000..48fbfa4 --- /dev/null +++ b/elasticluster_tutorial/sample_cj_config @@ -0,0 +1,4 @@ +CJID +CJKEY +SYNC_TYPE manual +SYNC_INTERVAL 300 diff --git a/elasticluster_tutorial/sample_cj_ssh_config b/elasticluster_tutorial/sample_cj_ssh_config new file mode 100644 index 0000000..bf945d7 --- /dev/null +++ b/elasticluster_tutorial/sample_cj_ssh_config @@ -0,0 +1,23 @@ +[gce] +host gce-frontend001.. +user +Bqs SLURM +Repo /home//CJRepo_Remote +MAT matlab/R2019a +MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi +Python python/3.8.8 +Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge +Alloc --time UNLIMITED +[gce] + +[gce-high-mem] +host gce-high-mem-frontend001.. +user +Bqs SLURM +Repo /home//CJRepo_Remote +MAT matlab/R2019a +MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi +Python python/3.8.8 +Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge +Alloc --time UNLIMITED +[gce-high-mem] diff --git a/elasticluster_tutorial/sample_elasticluster_config b/elasticluster_tutorial/sample_elasticluster_config new file mode 100644 index 0000000..ed06d32 --- /dev/null +++ b/elasticluster_tutorial/sample_elasticluster_config @@ -0,0 +1,65 @@ +[cloud/google] +provider=google +gce_client_id= +gce_client_secret= +gce_project_id= +noauth_local_webserver=True +zone= + +[login/google] +image_user= +image_user_sudo=root +image_sudo=True +user_key_name=elasticluster +user_key_private=~/.ssh/google_compute_engine +user_key_public=~/.ssh/google_compute_engine.pub + +[setup/ansible] +ansible_forks=20 +ansible_timeout=200 + +[setup/ansible-slurm] +provider=ansible +frontend_groups=slurm_master +compute_groups=slurm_worker + +# allow restart of compute nodes +compute_var_allow_reboot=yes +global_var_slurm_taskplugin=task/cgroup +global_var_slurm_proctracktype=proctrack/cgroup +global_var_slurm_jobacctgathertype=jobacct_gather/cgroup + +[cluster/gce] +cloud=google +login=google +setup=ansible-slurm +security_group=default +image_id=ubuntu-1804-bionic-v20210315a +flavor=n1-standard-4 +frontend_nodes=1 +compute_nodes=4 +ssh_to=frontend +boot_disk_type=pd-ssd +boot_disk_size=50 + +[cluster/gce/frontend] +boot_disk_size=100 + +########## + +[cluster/gce-high-mem] +cloud=google +login=google +setup=ansible-slurm +security_group=default +image_id=ubuntu-1804-bionic-v20210315a +flavor=n1-standard-4 +frontend_nodes=1 +compute_nodes=2 +ssh_to=frontend +boot_disk_type=pd-standard +boot_disk_size=100 + +[cluster/gce-high-mem/compute] +flavor=n2-highmem-4 +boot_disk_size=50 diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh new file mode 100644 index 0000000..9e68a0c --- /dev/null +++ b/elasticluster_tutorial/setup.sh @@ -0,0 +1,69 @@ +sudo apt update +sudo apt upgrade -y +sudo apt install gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virtualenv + +# Initialize gcloud. This is the only part of the script that requires interaction. +echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list +sudo apt-get install apt-transport-https ca-certificates gnupg +curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - +sudo apt-get update && sudo apt-get install google-cloud-sdk + +gcloud init +gcloud compute config-ssh + +# Create elasticluster virtual environment. +cd ~ +virtualenv --python=python3 elasticluster +echo ". elasticluster/bin/activate" >> ~/.bashrc +source ~/.bashrc +pip3 install --upgrade 'pip>=9.0.0' +cd elasticluster/ + +# Install elasticluster. +git clone https://github.com/gc3-uzh-ch/elasticluster.git src +cd src +pip install -e . + +##### Setup Elasticluster config +cd ~ +mkdir .elasticluster +cp ~/stats285.github.io/elasticluster_tutorial/sample_elasticluster_config ~/.elasticluster/config +sed -i "s//$GCE_CLIENT_ID/g" ~/.elasticluster/config +sed -i "s//$GCE_CLIENT_SECRET/g" ~/.elasticluster/config +sed -i "s//$GCE_PROJECT_ID/g" ~/.elasticluster/config +sed -i "s//$GCE_ZONE/g" ~/.elasticluster/config + +# Install perl package management prerequisites. +sudo apt install build-essential +sudo apt-get install libnet-ssleay-perl # required for cpan -i Net::SSLeay +sudo apt-get install libcrypt-ssleay-perl # required for cpan -i Net::SSLeay +sudo cpan install CPAN + +# Install ClusterJob prerequisites. +sudo cpan -i DateTime Time::Local Time::Piece +sudo cpan -i JSON JSON::XS JSON::PP +sudo cpan -i Data::Dumper Data::UUID +sudo cpan -i FindBin File::chdir File::Basename File::Spec +sudo cpan -i Net::SSLeay IO::Socket::INET IO::Socket::SSL +sudo cpan -i Getopt::Declare Term::ReadLine Digest::SHA +sudo cpan -i Moo HTTP::Thin HTTP::Request::Common URI + +# Install ClusterJob +cd ~ +git clone https://github.com/adonoho/clusterjob.git ~/CJ_install +echo "alias cj='perl ~/CJ_install/src/CJ.pl'" >> ~/.bashrc +source ~/.bashrc + +# Do CJ config and SSH ######################### +cp ~/stats285.github.io/elasticluster_tutorial/sample_cj_config ~/CJ_install/cj_config +sed -i "s//$CJID/g" ~/CJ_install/cj_config +sed -i "s//$CJKEY/g" ~/CJ_install/cj_config + +cp ~/stats285.github.io/elasticluster_tutorial/sample_cj_ssh_config ~/CJ_install/ssh_config +sed -i "s//$GCE_PROJECT_ID/g" ~/CJ_install/ssh_config +sed -i "s//$GCE_ZONE/g" ~/CJ_install/ssh_config +sed -i "s//$GCE_USERNAME/g" ~/CJ_install/ssh_config + +# Initialize CJ +cj init +cj who From 2e0d24aed452d031d2148d256de3a92fd39a42b0 Mon Sep 17 00:00:00 2001 From: motiwari Date: Fri, 4 Jun 2021 02:03:22 -0700 Subject: [PATCH 2/9] Nits, and adding information about OAuth consent in GCloud --- .../elasticlusterjob-tutorial.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index e4efa13..ab7f857 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -26,15 +26,15 @@ This tutorial is broken up into several steps: Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. - Sign up for a [free Google Cloud account](https://cloud.google.com/free). - Start at [GCP Dashboard](https://console.cloud.google.com/) -- Create a project from "IAM & Admin" menu choose "Create a Project". - - This project name is typically a combination of two random words and a number, e.g. "`superb-garden-303018`". Take note of this project name. +- Select the 3 bars in the top left >> IAM & Admin >> Create a Project. Choose a project name and take note of it. - Make sure that billing is [configured for your account](https://cloud.google.com/billing/docs/how-to/modify-project). -- Enable Compute Engine API and Firebase API in "[APIs & Services](https://console.cloud.google.com/apis)" -- Enable Project Credentials in the [Credentials menu](https://console.cloud.google.com/apis/credentials) on the "[APIs & Services](https://console.cloud.google.com/apis)" dashboard. - - Select your Google Cloud project. +- In the search bar at the top, search for "Compute Engine API" and enable it. Similarly, enable "Cloud Storage for Firebase API" +- Navigate to the [APIs & Services](https://console.cloud.google.com/apis) dashboard and click "Credentials". + - Ensure your Google Cloud project is selected at the top. - Click Create credentials. - Click "OAuth client ID". - - In the "Create client ID" page, for Application type, select "`Desktop`". + - If you need to configure the OAuth Consent Screen, choose "External", any App Name, and your email for "User Support Email" and Developer Contact Information. + - In the "Create client ID" page, for Application type, select "`Desktop app`". - Take note of the "Client ID" and "Client Secret". These will be required later for Elasticluster to create the cluster. As these credentials control access to your account, treat them with care; your budget may be at risk. (You can get them again if you lose them. Do not share them with classmates.) - Example Client ID: "`308342824695-eimtr7e8bqo7lotqlumj5mfmta1co8o4.apps.googleusercontent.com`" - Example Client Secret: "`_IdXWkmrunCuSLmPhL0ouaeV`" @@ -63,12 +63,12 @@ First, we create a blank VM with no OS installed: Now, we install Ubuntu on the blank VM: - Download the Ubuntu 18.04.5 LTS (Bionic Beaver) Server install image from https://releases.ubuntu.com/18.04/ -- Right click the new VM and click "Settings" -- Click on "Storage", and under "Controller: IDE" click on "Adds optical drive." (the blue circle with the green plus on it) +- Right click the new VM in VirtualBox and click "Settings" +- Click on "Storage", and next to "Controller: IDE" click on "Adds optical drive." (the blue circle with the green plus on it) - Click "Add" again and choose the Ubuntu .iso downloaded above - Press "OK" and Start the VM -The VM should begin booting up from the Ubuntu installation image and begin prompting you for installation choices. If you need to navigate away from the VM, press one of your modifier keys (command on Mac) to release the mouse from the VM. +The VM should begin booting up from the Ubuntu installation image and prompt you for installation choices. If you need to navigate away from the VM, press one of your modifier keys (command on Mac) to release the mouse from the VM. When installing Ubuntu: - Choose your language of choice @@ -86,7 +86,7 @@ Afterwards, the Ubuntu installation should complete in 5-10 minutes. Then choose # Step 2.A: Miscellaneous Bugs on Ubuntu -- If you're running into issues with random characters (like newlines) being inserted in your shell session, run the command `setterm -repeat off` +- If you're running into issues with random characters (like newlines) being inserted in your shell session, start the VM and run the command `setterm -repeat off` - If you're having trouble running `apt`, run the following command in your VM to set your nameserver in `/etc/resolv.conf`(some ISP's forwarding rules don't support DNS on your VM): `echo "sudo sed -i 's/nameserver.*/nameserver 8.8.8.8/g' /etc/resolv.conf" >> ~/.bashrc && source ~/.bashrc` # Step 3: Enable SSH on your Ubuntu VM From 90206cfb09649538b69b87ead0ba57a095a938be Mon Sep 17 00:00:00 2001 From: motiwari Date: Fri, 4 Jun 2021 02:37:31 -0700 Subject: [PATCH 3/9] adding automatic yes to installation prompts, further nits --- elasticluster_tutorial/elasticlusterjob-tutorial.md | 7 ++++--- elasticluster_tutorial/setup.sh | 12 ++++++------ 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index ab7f857..989a0a9 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -97,15 +97,13 @@ Ubuntu Virtualboxes have known issues with clipboard sharing between host and gu - In VirtualBox, go to your VM's Settings >> Network >> Adapter 1 >> Advanced >> Port Forwarding - Add a new port-forwarding rule with Name: SSH, Protocol: TCP, Host IP: 127.0.0.1, Host Port: 2222, Guest IP: blank, Guest Port: 22. -After starting the Ubuntu VM again, you should be able to ssh into it directly from your HOST machine (laptop) via `ssh YOUR_USERNAME@127.0.0.1 -p 2222` +After starting the Ubuntu VM again, you should be able to ssh into it directly from your HOST machine (laptop) via `ssh YOUR_UBUNTU_USERNAME@127.0.0.1 -p 2222` # Step 4: Install elasticluster and clusterjob SSH into your your running VM from a terminal on your HOST (laptop), then run the following commands, replacing the necesesary Google and Clusterjob variables that were determined when you signed up for them: ``` -git clone https://github.com/motiwari/stats285.github.io.git - echo "export GCE_USERNAME=" >> ~/.bashrc echo "export GCE_ZONE=" >> ~/.bashrc echo "export GCE_PROJECT_ID=" >> ~/.bashrc @@ -117,11 +115,14 @@ echo "export CJKEY=" >> ~/.bashrc source ~/.bashrc +git clone https://github.com/motiwari/stats285.github.io.git cd stats285.github.io/elasticluster_tutorial/ chmod +x setup.sh ./setup.sh ``` +The script `setup.sh` will require initial user interaction to pass a Google authentication challenge; everything afterwards is automatic. It will take some time to complete (~20 minutes). + # Step 5: Test elasticluster At this point, all the dependencies for elasticluster and clusterjob have been installed. To create a small memory cluster and establish communication to each node, run: diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index 9e68a0c..fb93deb 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -1,12 +1,12 @@ sudo apt update sudo apt upgrade -y -sudo apt install gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virtualenv +sudo apt install gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virtualenv | yes # Initialize gcloud. This is the only part of the script that requires interaction. echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list sudo apt-get install apt-transport-https ca-certificates gnupg curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - -sudo apt-get update && sudo apt-get install google-cloud-sdk +sudo apt-get update && sudo apt-get install google-cloud-sdk | yes gcloud init gcloud compute config-ssh @@ -34,9 +34,9 @@ sed -i "s//$GCE_PROJECT_ID/g" ~/.elasticluster/config sed -i "s//$GCE_ZONE/g" ~/.elasticluster/config # Install perl package management prerequisites. -sudo apt install build-essential -sudo apt-get install libnet-ssleay-perl # required for cpan -i Net::SSLeay -sudo apt-get install libcrypt-ssleay-perl # required for cpan -i Net::SSLeay +sudo apt install build-essential | yes +sudo apt-get install libnet-ssleay-perl | yes # required for cpan -i Net::SSLeay +sudo apt-get install libcrypt-ssleay-perl | yes # required for cpan -i Net::SSLeay sudo cpan install CPAN # Install ClusterJob prerequisites. @@ -52,7 +52,7 @@ sudo cpan -i Moo HTTP::Thin HTTP::Request::Common URI cd ~ git clone https://github.com/adonoho/clusterjob.git ~/CJ_install echo "alias cj='perl ~/CJ_install/src/CJ.pl'" >> ~/.bashrc -source ~/.bashrc +cd ~ && source ~/.bashrc # Do CJ config and SSH ######################### cp ~/stats285.github.io/elasticluster_tutorial/sample_cj_config ~/CJ_install/cj_config From 12bade69cfc8ba5c55fbb9ba45f2c7ea8437ea49 Mon Sep 17 00:00:00 2001 From: motiwari Date: Sun, 6 Jun 2021 10:43:24 -0700 Subject: [PATCH 4/9] Changing CJ_KEY to newline, using -y instead of |yes, avoiding local paths --- elasticluster_tutorial/sample_cj_config | 3 ++- elasticluster_tutorial/setup.sh | 8 ++++---- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/elasticluster_tutorial/sample_cj_config b/elasticluster_tutorial/sample_cj_config index 48fbfa4..4b6be48 100644 --- a/elasticluster_tutorial/sample_cj_config +++ b/elasticluster_tutorial/sample_cj_config @@ -1,4 +1,5 @@ CJID -CJKEY +CJKEY + SYNC_TYPE manual SYNC_INTERVAL 300 diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index fb93deb..8466964 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -1,12 +1,12 @@ -sudo apt update +sudo apt update -y sudo apt upgrade -y -sudo apt install gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virtualenv | yes +sudo apt install -y gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virtualenv # Initialize gcloud. This is the only part of the script that requires interaction. echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list sudo apt-get install apt-transport-https ca-certificates gnupg curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - -sudo apt-get update && sudo apt-get install google-cloud-sdk | yes +sudo apt-get update -y && sudo apt-get install -y google-cloud-sdk gcloud init gcloud compute config-ssh @@ -14,7 +14,7 @@ gcloud compute config-ssh # Create elasticluster virtual environment. cd ~ virtualenv --python=python3 elasticluster -echo ". elasticluster/bin/activate" >> ~/.bashrc +echo "source ~/elasticluster/bin/activate" >> ~/.bashrc source ~/.bashrc pip3 install --upgrade 'pip>=9.0.0' cd elasticluster/ From f507a5928a1d3eadcfc20ed607c7261459aafb05 Mon Sep 17 00:00:00 2001 From: motiwari Date: Wed, 16 Jun 2021 14:37:56 -0700 Subject: [PATCH 5/9] Some nits, automatically providing -y to apt, improving sed --- elasticluster_tutorial/elasticlusterjob-tutorial.md | 2 +- elasticluster_tutorial/setup.sh | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index 989a0a9..0e61742 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -126,7 +126,7 @@ The script `setup.sh` will require initial user interaction to pass a Google aut # Step 5: Test elasticluster At this point, all the dependencies for elasticluster and clusterjob have been installed. To create a small memory cluster and establish communication to each node, run: -``elasticluster start gce``` +```elasticluster start gce``` The `start` command provisions the nodes using Compute Engine and will take between 20-30 minutes. It configures the nodes by using the Ansible playbooks included in the Elasticluster source. Setup can take some time, depending on configuration. You will know when configuration is done when the output stops and you see the ending banner containing: "`Your cluster is ready!`" It is required practice that you update your `gcloud` keys after bringing up a new cluster using: ``` gcloud compute config-ssh diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index 8466964..081ce09 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -4,7 +4,7 @@ sudo apt install -y gcc g++ git libc6-dev libffi-dev libssl-dev python3-dev virt # Initialize gcloud. This is the only part of the script that requires interaction. echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list -sudo apt-get install apt-transport-https ca-certificates gnupg +sudo apt-get install -y apt-transport-https ca-certificates gnupg curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - sudo apt-get update -y && sudo apt-get install -y google-cloud-sdk @@ -28,16 +28,17 @@ pip install -e . cd ~ mkdir .elasticluster cp ~/stats285.github.io/elasticluster_tutorial/sample_elasticluster_config ~/.elasticluster/config +sed -i "s//$GCE_USERNAME/g" ~/.elasticluster/config sed -i "s//$GCE_CLIENT_ID/g" ~/.elasticluster/config sed -i "s//$GCE_CLIENT_SECRET/g" ~/.elasticluster/config sed -i "s//$GCE_PROJECT_ID/g" ~/.elasticluster/config sed -i "s//$GCE_ZONE/g" ~/.elasticluster/config # Install perl package management prerequisites. -sudo apt install build-essential | yes -sudo apt-get install libnet-ssleay-perl | yes # required for cpan -i Net::SSLeay -sudo apt-get install libcrypt-ssleay-perl | yes # required for cpan -i Net::SSLeay -sudo cpan install CPAN +sudo apt install -y build-essential +sudo apt-get install -y libnet-ssleay-perl # required for cpan -i Net::SSLeay +sudo apt-get install -y libcrypt-ssleay-perl # required for cpan -i Net::SSLeay +sudo cpan install -y CPAN # Install ClusterJob prerequisites. sudo cpan -i DateTime Time::Local Time::Piece From d17728b343d61bf1822e2dc3f3d7d0ee81c685d6 Mon Sep 17 00:00:00 2001 From: motiwari Date: Wed, 16 Jun 2021 14:47:47 -0700 Subject: [PATCH 6/9] Making updates for Assignment 2 --- .../elasticlusterjob-tutorial.md | 2 +- elasticluster_tutorial/sample_cj_ssh_config | 15 +++++++++++++++ .../sample_elasticluster_config | 19 +++++++++++++++++++ elasticluster_tutorial/setup.sh | 5 +++++ 4 files changed, 40 insertions(+), 1 deletion(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index 0e61742..c9caa6a 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -23,7 +23,7 @@ This tutorial is broken up into several steps: # Step 0: Sign up for Google Cloud and ClusterJob ## Step 0.a: Set up Google Cloud credentials. -Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. +Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). We highly recommend signing up with your stanford.edu email address to enable easy GPU quota increases in Assignment 2. Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. - Sign up for a [free Google Cloud account](https://cloud.google.com/free). - Start at [GCP Dashboard](https://console.cloud.google.com/) - Select the 3 bars in the top left >> IAM & Admin >> Create a Project. Choose a project name and take note of it. diff --git a/elasticluster_tutorial/sample_cj_ssh_config b/elasticluster_tutorial/sample_cj_ssh_config index bf945d7..5ed5119 100644 --- a/elasticluster_tutorial/sample_cj_ssh_config +++ b/elasticluster_tutorial/sample_cj_ssh_config @@ -21,3 +21,18 @@ Python python/3.8.8 Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge Alloc --time UNLIMITED [gce-high-mem] + +###### For Assignment 2 +[gce-gpu] +host gce-gpu-frontend001.. +user +Bqs SLURM +Repo /home//CJRepo_Remote +MAT matlab/R2019a +MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi +Python python/3.8.8 +Pythonlib pandas:requests:pytorch:torchvision:matplotlib +R R +Rlib ggplot2 +Alloc --time UNLIMITED +[gce-gpu] diff --git a/elasticluster_tutorial/sample_elasticluster_config b/elasticluster_tutorial/sample_elasticluster_config index ed06d32..64c2688 100644 --- a/elasticluster_tutorial/sample_elasticluster_config +++ b/elasticluster_tutorial/sample_elasticluster_config @@ -63,3 +63,22 @@ boot_disk_size=100 [cluster/gce-high-mem/compute] flavor=n2-highmem-4 boot_disk_size=50 + + +############### For Assignment 2 +[cluster/gce-gpu] +cloud=google +login=google +setup=ansible-slurm +security_group=default +image_id=ubuntu-1804-bionic-v20210315a +flavor=n1-standard-2 +frontend_nodes=1 +compute_nodes=2 +ssh_to=frontend +boot_disk_type=pd-standard +boot_disk_size=100 + +[cluster/gce-gpu/compute] +flavor=n1-highmem-4 +boot_disk_size=50 diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index 081ce09..3dbe528 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -55,6 +55,11 @@ git clone https://github.com/adonoho/clusterjob.git ~/CJ_install echo "alias cj='perl ~/CJ_install/src/CJ.pl'" >> ~/.bashrc cd ~ && source ~/.bashrc +# Download Alpha for Assignment 2 +cd ~ +git clone https://github.com/stats285/Alpha +cd ~ + # Do CJ config and SSH ######################### cp ~/stats285.github.io/elasticluster_tutorial/sample_cj_config ~/CJ_install/cj_config sed -i "s//$CJID/g" ~/CJ_install/cj_config From 03e932df04b53dcc02a11119f85244d2aed9649f Mon Sep 17 00:00:00 2001 From: motiwari Date: Wed, 16 Jun 2021 14:54:06 -0700 Subject: [PATCH 7/9] Removing recommendation to use stanford.edu email address because it causes lots of permissions problems --- elasticluster_tutorial/elasticlusterjob-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index c9caa6a..0e61742 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -23,7 +23,7 @@ This tutorial is broken up into several steps: # Step 0: Sign up for Google Cloud and ClusterJob ## Step 0.a: Set up Google Cloud credentials. -Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). We highly recommend signing up with your stanford.edu email address to enable easy GPU quota increases in Assignment 2. Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. +Any unique email address can get a [$300 credit toward Google Compute Engine time](https://cloud.google.com/free). Google provides a good overview of their system for technical computing [here](https://cloud.google.com/solutions/using-clusters-for-large-scale-technical-computing). This part of the tutorial has been cribbed from the Google authored tutorial to run "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)". Basically, you are going to create an project and credentials using the Google Compute Engine dashboard. Those credentials will be used by Elasticluster to instantiate the cluster. Because the Stanford cluster, Sherlock, uses SLURM to configure the nodes, we will use SLURM on GCE too. We are going to follow the modernized instructions from the "[R at Scale](https://cloud.google.com/solutions/running-r-at-scale)" page. - Sign up for a [free Google Cloud account](https://cloud.google.com/free). - Start at [GCP Dashboard](https://console.cloud.google.com/) - Select the 3 bars in the top left >> IAM & Admin >> Create a Project. Choose a project name and take note of it. From 9a4daa51283a13499b835eac3947f61dab737da5 Mon Sep 17 00:00:00 2001 From: motiwari Date: Wed, 16 Jun 2021 15:01:22 -0700 Subject: [PATCH 8/9] Nits on setup.sh --- elasticluster_tutorial/elasticlusterjob-tutorial.md | 2 ++ elasticluster_tutorial/setup.sh | 2 ++ 2 files changed, 4 insertions(+) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index 0e61742..28d454f 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -72,6 +72,7 @@ The VM should begin booting up from the Ubuntu installation image and prompt you When installing Ubuntu: - Choose your language of choice +- Do not update installer (if prompted) - Choose your keyboard configuration - Choose the default network connections - Leave proxy blank @@ -81,6 +82,7 @@ When installing Ubuntu: - Press continue - Set up your personal name, your server's name (machine name), your username on that machine, and the password for your username - Check "Install OpenSSH server", do not import SSH identity +- Do not install any Featured Snaps (if prompted) Afterwards, the Ubuntu installation should complete in 5-10 minutes. Then choose "Reboot" and press enter. diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index 3dbe528..340ee75 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -73,3 +73,5 @@ sed -i "s//$GCE_USERNAME/g" ~/CJ_install/ssh_conf # Initialize CJ cj init cj who + +echo "Setup complete! Don't forget to run `elasticluster start ` and `gcloud compute config-ssh`" From 9469f8a01ae4dc25ddf6e87875ce8317e3f9c2aa Mon Sep 17 00:00:00 2001 From: motiwari Date: Tue, 22 Jun 2021 16:20:37 -0400 Subject: [PATCH 9/9] Including troubleshooting elasticluster section, setting ansible_python_interpreter to avoid server errors, including mysql pacakges, and some nits --- .../elasticlusterjob-tutorial.md | 21 +++++++++++++++++-- elasticluster_tutorial/sample_cj_config | 2 +- elasticluster_tutorial/sample_cj_ssh_config | 6 +++--- .../sample_elasticluster_config | 4 ++++ elasticluster_tutorial/setup.sh | 2 +- 5 files changed, 28 insertions(+), 7 deletions(-) diff --git a/elasticluster_tutorial/elasticlusterjob-tutorial.md b/elasticluster_tutorial/elasticlusterjob-tutorial.md index 28d454f..82b6aa4 100644 --- a/elasticluster_tutorial/elasticlusterjob-tutorial.md +++ b/elasticluster_tutorial/elasticlusterjob-tutorial.md @@ -99,7 +99,7 @@ Ubuntu Virtualboxes have known issues with clipboard sharing between host and gu - In VirtualBox, go to your VM's Settings >> Network >> Adapter 1 >> Advanced >> Port Forwarding - Add a new port-forwarding rule with Name: SSH, Protocol: TCP, Host IP: 127.0.0.1, Host Port: 2222, Guest IP: blank, Guest Port: 22. -After starting the Ubuntu VM again, you should be able to ssh into it directly from your HOST machine (laptop) via `ssh YOUR_UBUNTU_USERNAME@127.0.0.1 -p 2222` +After starting the Ubuntu VM again, you should be able to ssh into it directly from your HOST machine (laptop) via `ssh YOUR_UBUNTU_USERNAME@127.0.0.1 -p 2222`. # Step 4: Install elasticluster and clusterjob @@ -129,7 +129,9 @@ The script `setup.sh` will require initial user interaction to pass a Google aut At this point, all the dependencies for elasticluster and clusterjob have been installed. To create a small memory cluster and establish communication to each node, run: ```elasticluster start gce``` -The `start` command provisions the nodes using Compute Engine and will take between 20-30 minutes. It configures the nodes by using the Ansible playbooks included in the Elasticluster source. Setup can take some time, depending on configuration. You will know when configuration is done when the output stops and you see the ending banner containing: "`Your cluster is ready!`" It is required practice that you update your `gcloud` keys after bringing up a new cluster using: +The `start` command provisions the nodes using Compute Engine and will take between 20-30 minutes. It configures the nodes by using the Ansible playbooks included in the Elasticluster source. Setup can take some time, depending on configuration. You will know when configuration is done when the output stops and you see the ending banner containing: "`Your cluster is ready!`" + +It is required practice that you update your `gcloud` keys after bringing up a new cluster using: ``` gcloud compute config-ssh ``` @@ -153,6 +155,21 @@ One destroys a cluster, equally unsurprisingly, with a "`stop`" command: elasticluster stop gce ``` +## Step 5.a: Troubleshooting +If the command `elasticluster start gce` produces any errors, ssh into your frontend node and run the following commands: + +``` +# Install slurm-drmaa +sudo add-apt-repository ppa:natefoo/slurm-drmaa +sudo apt-get update +sudo apt-get install slurm-drmaa-dev + +# Install ansible + +``` + +Then exit the frontend node and run `elasticluster setup gce`. + # Step 6: Test clusterjob Now that we have a compute cluster, it is time to perform a calculation using it using ClusterJob. Like all research software, ClusterJob has [basic documentation](https://clusterjob.org/documentation/). This is augmented by a draft chapter of a [Data Science book by Hatef Monajemi](https://monajemi.github.io/datascience/pages/elasticluster-clusterjob-model). This tutorial is a distillation of these other works in the very pragmatic context of running a simple example for this class. Most users borrow an existing set of configuration files and call it a day. As we expand a cluster's hardware to include GPUs, the configuration files will evolve. Those extensions will be discussed in class. diff --git a/elasticluster_tutorial/sample_cj_config b/elasticluster_tutorial/sample_cj_config index 4b6be48..e1bcefd 100644 --- a/elasticluster_tutorial/sample_cj_config +++ b/elasticluster_tutorial/sample_cj_config @@ -1,5 +1,5 @@ CJID -CJKEY +CJKEY SYNC_TYPE manual SYNC_INTERVAL 300 diff --git a/elasticluster_tutorial/sample_cj_ssh_config b/elasticluster_tutorial/sample_cj_ssh_config index 5ed5119..ceca2cf 100644 --- a/elasticluster_tutorial/sample_cj_ssh_config +++ b/elasticluster_tutorial/sample_cj_ssh_config @@ -6,7 +6,7 @@ Repo /home//CJRepo_Remote MAT matlab/R2019a MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi Python python/3.8.8 -Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge +Pythonlib python3-pymysql:PyMySQL:IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge Alloc --time UNLIMITED [gce] @@ -18,7 +18,7 @@ Repo /home//CJRepo_Remote MAT matlab/R2019a MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi Python python/3.8.8 -Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge +Pythonlib python3-pymysql:PyMySQL:IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge Alloc --time UNLIMITED [gce-high-mem] @@ -31,7 +31,7 @@ Repo /home//CJRepo_Remote MAT matlab/R2019a MATlib ~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi Python python/3.8.8 -Pythonlib pandas:requests:pytorch:torchvision:matplotlib +Pythonlib python3-pymysql:PyMySQL:pandas:requests:pytorch:torchvision:matplotlib R R Rlib ggplot2 Alloc --time UNLIMITED diff --git a/elasticluster_tutorial/sample_elasticluster_config b/elasticluster_tutorial/sample_elasticluster_config index 64c2688..1a45684 100644 --- a/elasticluster_tutorial/sample_elasticluster_config +++ b/elasticluster_tutorial/sample_elasticluster_config @@ -29,6 +29,10 @@ global_var_slurm_taskplugin=task/cgroup global_var_slurm_proctracktype=proctrack/cgroup global_var_slurm_jobacctgathertype=jobacct_gather/cgroup +# Added by Mo to avoid "Shared connection to xxx.xxx.xxx.xxx closed" errors +# See https://www.tecmint.com/fix-shared-connection-to-x-x-xx-closed-ansible-error/ +ansible_python_interpreter=/usr/bin/python3 + [cluster/gce] cloud=google login=google diff --git a/elasticluster_tutorial/setup.sh b/elasticluster_tutorial/setup.sh index 340ee75..d1e7a8b 100644 --- a/elasticluster_tutorial/setup.sh +++ b/elasticluster_tutorial/setup.sh @@ -74,4 +74,4 @@ sed -i "s//$GCE_USERNAME/g" ~/CJ_install/ssh_conf cj init cj who -echo "Setup complete! Don't forget to run `elasticluster start ` and `gcloud compute config-ssh`" +echo "Setup complete! Don't forget to run elasticluster start cluster_name and gcloud compute config-ssh"