Parallelising

NMMSO supports parallel execution using Python's Process-based parallelism model or MPI. Process-based parallelism works if wish to run pynmmso on a multicore machine (for example, across all the cores of a personal computer or laptop). MPI is ideal if when you wish to run pynmmso across multiple nodes in a compute cluster.

NMMSO's parallelism is based on the assumption that the fitness function evaluations dominate the runtime of the algorithm. This is probably not the case for simple mathematical functions of the type used in the examples in this documentation but is very likely to be true for more complex scenarios such as when the fitness function requires executing one or more simulations and comparing the simulations' outputs with experimental data.

When either of the parallel approaches are used NMMSO executes some of the fitness function evaluations in parallel. There are two points in the algorithm where parallel fitness function evaluations are possible. These are:

When incrementing the swarms, where there will be a fitness function evaluation for each swarm (up to a configurable maximum number of swarms to evolve (see the max_evol parameter of Nmmso))
When merging swarms, where the mid-point between swarms will be evaluated.

Of these two points the first one is likely to be the place were the algorithm can make the greatest most use of parallelism. The more swarms that the algorithm maintains the greater the opportunity for parallelism. The number of swarms the algorithm maintains is unbound but will closely correspond to the number of optima in the problem.

Parallelism uses worker task processes that are given parameter space values, evaluate the fitness function, and return the fitness value. The controller process runs the NMMSO algorithm and hands out the fitness function evaluations to the worker tasks.

It is possible to use the ParallelPredictorListener class to obtain a estimate of the minimum runtime (as a percentage of the sequential runtime) that could be obtained using parallelism with different numbers of workers. This minimum runtime assumes that fitness evaluations totally dominate the execution time and that the overhead of using parallelism is negligible. These are often optimistic assumptions.

The following code uses a ParallelPredictorListener to estimate the possible advantages of parallelism for the 2D example above:

from pynmmso import Nmmso
from pynmmso.listeners import ParallelPredictorListener

class My2DProblem:
    @staticmethod
    def fitness(params):
        x = params[0]
        y = params[1]
        return -x**4 + x**3 + 3 * x**2 -y**4 + y**3 + 3 * y**2

    @staticmethod
    def get_bounds():
        return [-2, -2], [3, 3]

def main():
    number_of_fitness_evaluations = 5000

    nmmso = Nmmso(My2DProblem())

    parallel_predictor = ParallelPredictorListener()
    nmmso.add_listener(parallel_predictor)

    nmmso.run(number_of_fitness_evaluations)

    for mode_result in nmmso.get_result():
        print("Mode at {} has value {}".format(mode_result.location, mode_result.value))

    parallel_predictor.print_summary()


if __name__ == "__main__":
    main()

An example of the output produced below. Note that the output may vary for each run.

Mode at [-0.90627021  1.65566419] has value 6.292967646772195
Mode at [1.65586885 1.65586886] has value 10.495819650100618
Total number of fitness evaluations: 5002
Maximum number of workers: 5
2 workers, effective evaluations 3422, minimum runtime as percentage of sequential runtime: 68.41%
4 workers, effective evaluations 2822, minimum runtime as percentage of sequential runtime: 56.42%
6 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
8 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
10 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
12 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
16 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
32 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
64 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%
128 workers, effective evaluations 2806, minimum runtime as percentage of sequential runtime: 56.10%

This output shows that the total number of fitness evaluations is 5,002. The maximum number of workers that would result in any performance gain is 5. Introducing two workers would lead to 3,422 effective evaluations. Fitness evaluations that occur in parallel count as just one effective evaluation. For example, if 8 fitness evaluations could be executed in parallel over 2 workers then that would count as only 4 effective evaluations.

With two workers the minimum runtime is 68.41% of the sequential runtime. With 4 workers this drops a little to 56.42% but there is little to be gained by increasing to 6 workers or beyond.

Note that in this particular case the fitness function is quick to calculate so in practice adding 2 workers is unlikely to result in any significant performance gain but in cases where the fitness function requires a lot of calculation parallelism may offer some useful advantages.

Process-based parallelism

To use Python's process-based parallelism to implement workers you must pass an instance of MultiprocessorFitnessCaller when creating Nmmso. The instance of MultiprocessorFitnessCaller is created with an argument that specifies the number of workers to create. In the example below, we use the with construct to define an instance of the MultiprocessorFitnessCaller class. (If you do not use the with construct, you must call .finish() on your instance of MultiprocessorFitnessCaller)

The following code runs NMMSO using 3 worker processes:

from pynmmso import Nmmso
from pynmmso import MultiprocessorFitnessCaller

class MyProblem:
    @staticmethod
    def fitness(params):
        x = params[0]
        return -x**4 + x**3 + 3 * x**2

    @staticmethod
    def get_bounds():
        return [-2], [2]


def main():
    number_of_fitness_evaluations = 1000
    num_workers = 3
    
    with MultiprocessorFitnessCaller(num_workers) as my_multi_processor_fitness_caller:
        nmmso = Nmmso(MyProblem(), fitness_caller=my_multi_processor_fitness_caller)
        my_result = nmmso.run(number_of_fitness_evaluations)
    
    for mode_result in my_result:
        print("Mode at {} has value {}".format(mode_result.location, mode_result.value))

if __name__ == "__main__":
    main()

Note that one of the cores is used for the controlling code so it is best to set the number of workers to be one less than the number of cores on your machine.

Using an HPC queuing system

If you want to run this on an HPC system with a queuing system, such as PBS, you will need to write an appropriate job submission script. This can be very simple, such as:

#!/bin/bash --login

#PBS -N pynmmso
#PBS -l select=1:ncpus=4
#PBS -l place=pack
#PBS -l walltime=00:01:00
#PBS -A <job code>

cd $PBS_O_WORKDIR
python my_pynmmso.py &> stdout.txt

This script will run my_pynmmso.py on 4 cores of a single node, for a maximum of one hour (the -A flag may not be necessary, depending on the cluster you are using). We recommend that you set the number of cores equal to num_workers you give to MultiprocessorFitnessCaller. For simple multiprocessing, where all jobs are being done on a single node, select must always be equal to 1.

Pickling the problem class

When using process-based parallelism the instance of the problem class will be send to the worker processes. To send the instance to the workers it will be serialised using Python's pickle functionality. It is up to you to ensure that your problem class can be pickled.

Multiprocess rather than multiprocessing

The MultiprocessorFitnessCaller class uses the python multiprocess package rather than the standard python multiprocessing. The multiprocess package includes improvements over the standard package that, among other things, supports better integration with Jupyter Notebooks on Windows platforms.

MPI

To use MPI you must first ensure your machine has MPI installed on it and you must also install mpi4py.

To use MPI-based parallelism to implement workers you must pass an instance of MpiFitnessCaller when creating Nmmso. The instance of MpiFitnessCaller is created with an argument that is the MPI comm object. In the example below, we use the with construct to define an instance of the MpiFitnessCaller class. (If you do not use the with construct, you must call .finish() on your instance of MpiFitnessCaller).

The Nmmso object must be created and used within the rank 0 process and the MpiFitnessCaller's run_worker method must be called for all non-zero rank processes.

The following code runs NMMSO using MPI:

from mpi4py import MPI
from pynmmso import Nmmso
from pynmmso.mpi_fitness_caller import MpiFitnessCaller

class MyProblem:
    @staticmethod
    def fitness(params):
        x = params[0]
        return -x**4 + x**3 + 3 * x**2

    @staticmethod
    def get_bounds():
        return [-2], [3]


def main():

    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    
    my_mpi_fitness_caller = MpiFitnessCaller(comm)

    if 0 == rank:
        number_of_fitness_evaluations = 1000

        with MpiFitnessCaller(comm) as my_mpi_fitness_caller:
            nmmso = Nmmso(MyProblem(), fitness_caller=my_mpi_fitness_caller)
            my_result = nmmso.run(number_of_fitness_evaluations)
            
        for mode_result in my_result:
            print("Mode at {} has value {}".format(mode_result.location, mode_result.value))
    else:
        my_mpi_fitness_caller.run_worker()   


if __name__ == "__main__":
    main()

To excute the program use a suitable MPI command such as:

mpirun -n 8 python my_nmmso_code.py

or

mpiexec -n 8 python my_nmmso_code.py

or the appropriate approach for running MPI jobs on your compute cluster of HPC resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising

Process-based parallelism

Using an HPC queuing system

Pickling the problem class

Multiprocess rather than multiprocessing

MPI

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally