This repository contains source for the paper "Job Scheduling for HPC Clusters: Constraint Programming vs. Backfilling Approaches" (presented at DEBS'2024). DOI: 10.1145/3629104.3666038
The simulator is based on SC'2015 Improving Backfilling by using Machine Learning to predict Running Times and SBACPAD'2022 Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling. Some files from the original repository do not work properly anymore.
Because the simulator uses pyss (https://code.google.com/archive/p/pyss/), which was published under GNU GPL v2 license,
the additions are published under the same license.
The CP-based schedulers require CP Optimizer, which must be installed separately. The schedulers use the Python API of CPOptimizer docplex (see ./pyss/requirements.txt).
A new sort order, "Shortest runtime-area product first" (labeled SRD2F) is added to pyss/schedulers/sorters.py and can be used with any scheduler that uses sorters.
The same scheduler can be configured to make a CP-based scheduler with any of the four objective functions. See examples of the configuration files in configs/CPLEX_OF/ directory.
The scheduler used to implement CP-BSLD+ and CP-P2SF+ algorithms. See the configuration files configs/CPLEX_OF/BestOfN_BSLD_Clairvoyant.py and configs/CPLEX_OF/BestOfN_P2SF_Clairvoyant.py for the details.
pyss/schedulers/cplex_basic_scheduler.py and pyss/schedulers/cplex_bestof2_scheduler.py are earlier versions of the CP-based schedulers. They are not used in the paper.
Some CP-based schedulers can be configured to use checkpointing (which is not a checkpointing but rather a journaling). The checkpointing is enabled by setting use_checkpointing to True in the configuration file. See configs/CPLEX_OF/BestOfN_P2SF_Clairvoyant.py for an example.
When the checkpointing is enabled, the scheduler will save the state of the solver to the file with extention .checkpointing appended to the ouput file name.
When the scheduler is restarted, it checks if the checkpointing file exists and if it does, it will fast forward the previous decicions stored in the checkpointing file.
The major change in the simulator engine is related to the mechanism by which the scheduler is engaged. In the original simulator, the scheduling is initiated after each termination/submission of a job. If several jobs are terminated at a same time, the scheduler ran several times as well. Now, the scheduler never runs more than once per second. This is attained by scheduling RunSchedulerEvent, which has the lowest priority and which is not scheduled again until the already scheduled event occurs. Such behavior is more consistent with real-life behavior.
EASY scheduler that can be configured to use various "initial" and "backfill" sorting orders.
The sorting orders are defined in schedulers/sorters.py
EASY scheduler in which the job queue before backfilling is sorted largest area first (the first job is scheduled according to the original order). This scheduler is similar (although somewhat opposite) to the EASY++ (which is EASY with shortest-job-first backfilling)
the same scheduler can be cofigured using Customizable Easy
This set of schedulers is build from pyss/schedulers/list_prediction_scheduler.py:
- Largest area first:
pyss/schedulers/l_a_f_scheduler.py - Longest job first:
pyss/schedulers/l_j_f_scheduler.py - Largest resource requirement first:
pyss/schedulers/l_r_f_scheduler.py - Smallest area first:
pyss/schedulers/s_a_f_scheduler.py - Shortest job first:
pyss/schedulers/s_j_f_scheduler.py
The resources are reserved for each job so that none of the jobs can be delayed by backfilling the lowest priority job. Note that if the predictions of job durations are inaccurate, this guarantee does not hold.
By configuring the sorting order, the scheduler can be modified to be SAF-JustBF, SJF-JustBF, LAF-JustBF, etc.
Note that the original conservative backfill (conservative_scheduler.py) is implemented according to
Feitelson, D.G. and Weil, A.M. 1998. Utilization and predictability in scheduling the IBM SP2 with backfilling. Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 542–546.
When a job finishes, it does not do full reschedule but "compresses it" (in _reschedule_jobs(self, current_time) a job may not get an earlier time due to conflict with a lower priority job scheduled in a previous attempt). In pure_b_f_scheduler.py all jobs are rescheduled at each run of the scheduler.
Some predictors (including pyss/predictors/predictor_clairvoyant.py and pyss/predictors/predictor_reqtime.py) have predict_multiplier option that allow alter the prediction by multiplying it by the value of the option. This allows to double the predictions or make a "sweep" that is described in the paper.
"Complete" aka "Hierarchy" predictor that uses 16 templates.
It also has an option to increase prediction by sigma*sigma_factor. Note that the behavior is different when sigma_factor=None and when sigma_factor=0 (in the latter case a category still require to have 2 or more prior observation in order to be considered).
This is similar to predictor_complete but instead of using 16 hierarchical templates it only has one (the most specific) template. If no record exist, the predictor falls back to the user requested timelimit right away.
It also has an option to increase prediction by sigma*sigma_factor (unchanged from predictor_complete). Note that the behavior is different when sigma_factor=None and when sigma_factor=0 (in the latter case a category still require to have 2 or more prior observation in order to be considered).
Survival probability (SD-x%) aka "Top-percent" predictor, described elsewhere.
pyss/predictors/predictor_conditional_percent.py describes a slightly modified version of this approach.
The common script to run an experiment is pyss/run_simulator.py, which expects command line arguments as described in the source file documentation.
Running multiple experiments with different configurations can be performed with bin/run_batch.py and with bin/run_sweep.py, as described in the help messages of the scripts.
Set of configuration for the Example 1: algorithms.
Set of configuration for the Example 2: runtime predictions.
Configurations that were found to reproduce best the results from Improving Backfilling by using Machine Learning to predict Running Times (SC'15)
The cleaned and filtered swf files used for the simulations are in data/swf_filtered.
Some files (including
pyss/run_valexpe_client.py,
pyss/run_valexpe_server.py, and
pyss/run_valexpe.py)
were not tested and likely are outdated.
Scripts used to analyze the results resides in analysis directory. Please see documentation inside the scripts
The script plot_frequent_job_history.py (not used for the paper) is better suited to analysis directory but is located in pyss because it uses predictors from the simulator.
This repository is home to the scheduling simulator, machine learning tools, and experimental results from the SC'15 submission Improving Backfilling by using Machine Learning to predict Running Times
The following files contain metrics for all the so-called heuristic triples, in the CSV format.
experiments/data/CEA-curie/sim_analysis/metrics_complete
experiments/data/KTH-SP2/sim_analysis/metrics_complete
experiments/data/CTC-SP2/sim_analysis/metrics_complete
experiments/data/SDSC-SP2/sim_analysis/metrics_complete
experiments/data/SDSC-BLUE/sim_analysis/metrics_complete
experiments/data/Metacentrum2013/sim_analysis/metrics_complete
The Scheduling simulator used in this paper is a fork of the pyss open source scheduler. It found in the folder:
simulation/pyss/src
Implementations of the NAG algorithm for learning the model are located at:
simulation/pyss/src/predictors/