This README describes how to prepare a scalability study workflow.
In the examples that follow, the shell variable SCRATCH defines a scratch directory associated with the user.
To prepare scalability studies, you must first select which XML files you want
involved in the study. Use the linkXmlFiles.bash script to create links to
the files to be preprocessed into pickle files for more efficient execution of
the pubmed script. In this example the 64 largest files are selected:
# The path to all of the MEDLINE XML files
ALL\_RAW\_XML\_FILES=$SCRATCH/MEDLINE/raw
mkdir $SCRATCH/medline/raw/64files
./linkXmlFiles.bash $ALL_RAW_XML_FILES $SCRATCH/medline/raw/64files
The XML files selected for the scalability study can then be converted to pickle
file format. Before the conversion can take place, the
medline/config/default.cfg file in the medline python package must be updated
to specify the destination directory for the pickle files. Set the
temp.data.directory value in the configuration file to
$SCRATCH/medline/pickled/64files where $SCRATCH should be replaced with the
full path of the user's scratch location. The conversion of the selected XML
files to pickle format can then be performed with the following command:
./runFileConversion.bash $SCRATCH/medline/raw/64files
Read the comments in the runFileConversion.bash script for more details.
The configurations and the batch submission scripts for conducting the
scalability studies can then be generated with the prepareScalingTests.bash
script. The script generates a seperate configuration directory for each
combination of number of nodes, number of threads, and number of clusters to be
tested in the scalability study. A seperate pubmed configuation file will be
generated for each test based on these three parameters. The first argument to
the script is the base directory containing the subdirectories for each set of
pickled data files. In our examples, this directory is
$SCRATCH/medline/pickled. The second argument is the name of the test data
set. In our examples, this is the string 64files. The third argument is the
default port number H2O will use. The H2O server will attempt to acqure this
port number and the next one higher. The default port number is often
overridden by the scaleH2OTest.bash script. The tests to be generated by the
script are determined by the num_nodes, num_threads, and num_clusters
shell script arguments. An example exection of the script is:
num_nodes="01 02 03 04 08 16"
num_threads="01 02 04 08 16"
num_clusters="01000 02000 04000 08000 15000"
./prepareScalingTests.bash $SCRATCH/medline/pickled 64files 54321 \
${num_nodes} ${num_threads} ${num_clusters}
Read the comments in the prepareScalingTests.bash script for more details.
Once the the configuration directories have been generated for the scalability
study, the batch job for each test can be launched with the scaleH2OTest.bash
script. The script must first be customized to the user's environment before
proceeding with the launching the batch script though. Follow the instructions
in the script file for setting the PORT_RANGE_START, PORT_RANGE_END,
H2O_JAR, SOURCE_DIR, PATH, and PYTHONPATH shell variables. The
FEATURE_EXTRACTION_PATH variable should only have to be changed in rare
circumstances.
The script runScalingTests.bash is executed to launch the batch job for each
performance test. It loops over a subset of the test directories generated by
the prepareScalingTests.bash script. The script takes the name of the test
data set the scalability study is being performed with, the wall clock time in
qsub format, an array of the numbers of nodes, an array of the numbers of
threads, and an array of the numbers of clusters to be discovered. An example
execution of the script to run the tests created in the previous example is:
num_nodes="01 02 03 04 08 16"
num_threads="01 02 04 08 16"
num_clusters="01000 02000 04000 08000 15000"
./runScalingTests.bash 64files 48:00:00 \
${num_nodes} ${num_threads} ${num_clusters}