Skip to content

jrmccombs/pubmed_h2o_scaling

Repository files navigation

This README describes how to prepare a scalability study workflow.

Test setup and execution

In the examples that follow, the shell variable SCRATCH defines a scratch directory associated with the user.

Input file selection

To prepare scalability studies, you must first select which XML files you want involved in the study. Use the linkXmlFiles.bash script to create links to the files to be preprocessed into pickle files for more efficient execution of the pubmed script. In this example the 64 largest files are selected:

# The path to all of the MEDLINE XML files
ALL\_RAW\_XML\_FILES=$SCRATCH/MEDLINE/raw

mkdir $SCRATCH/medline/raw/64files
./linkXmlFiles.bash $ALL_RAW_XML_FILES $SCRATCH/medline/raw/64files

Input XML file conversion to pickle file format

The XML files selected for the scalability study can then be converted to pickle file format. Before the conversion can take place, the medline/config/default.cfg file in the medline python package must be updated to specify the destination directory for the pickle files. Set the temp.data.directory value in the configuration file to $SCRATCH/medline/pickled/64files where $SCRATCH should be replaced with the full path of the user's scratch location. The conversion of the selected XML files to pickle format can then be performed with the following command:

./runFileConversion.bash $SCRATCH/medline/raw/64files

Read the comments in the runFileConversion.bash script for more details.

Preparation of performance tests

The configurations and the batch submission scripts for conducting the scalability studies can then be generated with the prepareScalingTests.bash script. The script generates a seperate configuration directory for each combination of number of nodes, number of threads, and number of clusters to be tested in the scalability study. A seperate pubmed configuation file will be generated for each test based on these three parameters. The first argument to the script is the base directory containing the subdirectories for each set of pickled data files. In our examples, this directory is $SCRATCH/medline/pickled. The second argument is the name of the test data set. In our examples, this is the string 64files. The third argument is the default port number H2O will use. The H2O server will attempt to acqure this port number and the next one higher. The default port number is often overridden by the scaleH2OTest.bash script. The tests to be generated by the script are determined by the num_nodes, num_threads, and num_clusters shell script arguments. An example exection of the script is:

num_nodes="01 02 03 04 08 16"
num_threads="01 02 04 08 16"
num_clusters="01000 02000 04000 08000 15000"
./prepareScalingTests.bash $SCRATCH/medline/pickled 64files 54321 \
  ${num_nodes} ${num_threads} ${num_clusters}

Read the comments in the prepareScalingTests.bash script for more details.

Launching the performance test batch jobs

Once the the configuration directories have been generated for the scalability study, the batch job for each test can be launched with the scaleH2OTest.bash script. The script must first be customized to the user's environment before proceeding with the launching the batch script though. Follow the instructions in the script file for setting the PORT_RANGE_START, PORT_RANGE_END, H2O_JAR, SOURCE_DIR, PATH, and PYTHONPATH shell variables. The FEATURE_EXTRACTION_PATH variable should only have to be changed in rare circumstances.

The script runScalingTests.bash is executed to launch the batch job for each performance test. It loops over a subset of the test directories generated by the prepareScalingTests.bash script. The script takes the name of the test data set the scalability study is being performed with, the wall clock time in qsub format, an array of the numbers of nodes, an array of the numbers of threads, and an array of the numbers of clusters to be discovered. An example execution of the script to run the tests created in the previous example is:

num_nodes="01 02 03 04 08 16"
num_threads="01 02 04 08 16"
num_clusters="01000 02000 04000 08000 15000"
./runScalingTests.bash 64files 48:00:00 \
  ${num_nodes} ${num_threads} ${num_clusters}

About

Shell scripts for setting up and launching scalability performance studies of pubmed with H2O

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages