Skip to content

Occam-ML-Lab/machine-learning-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Machine Learning Workflows

Machine learning often requires you to run long-running scripts. It's not convenient to do this on your laptop. At some point, everyone will want to switch to a set-up where you write code on your local machine, but run experiments on a remote computer. This document is intended to help you set this up. Overall, this is a laborious process, and you will need to develop a lot of skills around managing computers. There is no way around this. You will have to google around loads to build these skills independently.

This guide is not complete since nobody has time to write this stuff in a huge amount of detail, and because details change all the time. However, hopefully it will point you in the right direction. It is assumed that you know how to use a terminal, some shell scripting, and git. MIT's Missing Semester of Your CS Education course is a good place to start to learn more about this.

Some guides specifically for Imperial students:

Overall Workflow

  • Write code and debug on small test cases on your local machine.
  • Copy/sync code to your remote machine.
    • The low-tech solution is to use the tool scp (guide).
    • The better solution is to have your coding IDE automatically sync your code. Visual Studio Code and PyCharm can both do this after setting it up.
  • ssh (guide) into an Imperial machine, and run your script.
    • Make sure you don't run directly in the log-in shell. If you sever the ssh connection from your laptop, it kills the process! Running inside tmux solves this (guide).
  • Your script should write results to a file, while it's running.
    • Scripts can be killed (e.g. if somebody resets the computer it's running on).
    • You should make sure that your code writes all the results you need to a file, ideally while it's running.
    • Once in a while, you can copy the results back to your local machine, and have a separate script make plots. This is a great way to check up on experiments, even while your main script is still running!
    • You also allow your scripts to be restartable: If your script does get killed, you could reload intermediate results, and pick up where you left off.
    • You should write your experiment outputs to /vol/bitbucket/ (guide).
    • Network file storage (NFS) is super useful! It's accessible from any computer!

DoC Machines

DoC has lots of machines that you can log into. These are located in the main computer lab. It may be a good idea to go to the computer lab phsycially, and sit next to the machine that you log into through your laptop. This way, you can see how your remote commands affect the machine while you're also logged into it physically!

See the list of DoC lab workstations.

  • Most machines have several cores. This can be super useful. Remember, lots of CPUs can do a lot in parallel!
  • Some machines have GPUs. I think you can ask CSG to have priority or even sole access to a GPU machine, if you really need it for your project.

Parallel Experiments

If you need to run many similar experiments in parallel (e.g. grid-search over different parameter values), you can consider using cluster functionality. CSG provides a great guide, and seems to provide two interfaces:

  • Condor for CPU-only tasks. This can give you access to hundreds of cores!
  • Slurm for GPU-enabled tasks.

I find jug a great tool for running many similar experiments in parallel. You can make this work nicely with cluster managers like Condor/Slurm, because it only needs an NFS to coordinate running jobs.

If you need more resources, you could even look at the Research Computing Service, but an academic may need to grant you access to this.

Please contribute more.

Coding IDE & Remote Sync

How to set up VScode or PyCharm to sync your code to a remote machine. Please contribute.

Python Environment

It's usually a good idea to use anaconda as your Python environment. It's super easy to set up and lives in a single folder. So if you mess up your environment, you can just delete it, and start from a clean slate again. I installed my anaconda setup in /vol/bitbucket, so my Python setup is uniform across all network machines.

About

Tips on effective machine learning workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors