Job submission #48

isehle · 2023-10-09T12:12:30Z

PR Description

This PR is specifically for submitting jobs on HT Condor, using the new Python script job_submit.py located in bye_splits/production/submit_scripts. All configuration is done in config.yaml under job. This configuration should contain the usual Condor related variables (user, proxy, queue, local) as well as a path to the script you wish to run, and a list of arguments that the script accepts. There are then configurations for each particle type listed under their names (plural), which should contain:

submit_dir, i.e. the path where you want your submission related scripts to be written to and read from;
files which should contain a path to a txt file with your various parameters, line delimited;
files_per_batch which is self explanatory.

In the case of the cluster size studies, files points to a .txt file with all clustering radii, line delimited.
job_submit.py contains three classes, JobBatches which splits the total number of jobs into batches as defined above. If the chosen number can't be split evenly, the last batch will be shorter than the preceding ones. The second class, CondJobBase writes the individual batch files if needed, and writes the .sh and .sub scripts needed for the job(s). Finally, CondJob creates configs/, jobs/, and logs/ sub-directories under the submit_dir in the configuration script, then creates and runs the jobs. The base name for all files written by job_submit.py will be the same as the basename for the script being called, as set in the configuration. The file information is stored in buffer-lists before being written, and are checked against files with the same name before being written. If the contents of the new file are equivalent to a previously written file, the script will opt to use that file rather than write a new one. Otherwise, it will increment the version number, write the new file and use that. This is handled through a few new functions in common.by, with the main function being conditional_write.

How has this been tested?

Tested against all particle types with both 0 and 200 PU.

Checklist

I have performed a self-review of my code
My code follows the naming convention of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

bfonta

Nice to have this close to integration! A few general comments:

IIRC files was a bad argument name that had to be changed
Instead of handling the submission with files, cluttering the environment, it would be cleaner to use other options for the queue ... command (see the docs)
Could you please remove the produce.cc file, which is no longer needed?
You are missing doc strings everywhere

bye_splits/production/submit_scripts/job_submit.py

bye_splits/utils/cl_helpers.py

bye_splits/utils/common.py

config.yaml

isehle · 2023-10-11T15:59:25Z

The two commits that I just pushed should address all of the comments, let me know if there's still something I can update.

bfonta

I've added a few more comments.
Regarding the previous review, you did not address the general comments, and importantly the comment on the queue-related command, which is quite central to the PR. It would require a bit of extra work, but would be more robust for future changes and more readable. It would make the code shorter, since IIUC much of the code for checking the files would not be required any longer.

And thanks for documenting the code; it makes a big difference!

bye_splits/scripts/cluster_size/run_init_tasks.py

bye_splits/production/submit_scripts/job_submit.py

bye_splits/scripts/cluster_size/run_combine.py

bye_splits/production/submit_scripts/condor_cluster_size.sh

bye_splits/utils/job_helpers.py

config.yaml

isehle · 2023-10-12T15:12:32Z

Latest commits should resolve the rest of the comments.

bfonta

Only minor details. In addition, could you comment on the following:

IRC files was a bad argument name that had to be changed

If possible, could @mchiusi also have a careful look?

README.md

bye_splits/production/submit_scripts/condor_cluster_size.sh

bye_splits/scripts/cluster_size/condor/run_cluster.py

bye_splits/scripts/cluster_size/run_init_tasks.py

bye_splits/utils/cl_helpers.py

mchiusi

Great job, Isaac!
I've been testing the flexibility of your code by running another script, produce.py, and I've identified some critical points that need to be improved. For the rest, everything is good and well-written.

bye_splits/production/submit_scripts/job_submit.py

bye_splits/utils/job_helpers.py

mchiusi · 2023-10-13T14:34:34Z

bye_splits/production/submit_scripts/job_submit.py

+        arg_keys = "Arguments ="
+        for arg in self.args.keys():
+            arg_keys += " $({})".format(arg)
+        arg_keys += "\n"


I would make it a bit more flexible. In this way, for instance, I want to run produce.py I will get an error, since the script does not recognise the option --radius which is reported here.
It would be good is only the needed arguments are taken into account.

Well that's the point of the configuration file; it's up to the user to modify config.yaml with the arguments and values that script accepts.

I partially agree with both comments: I would add a check that guarantees the argument in the configuration file is supported by the script, raising an exception otherwise. This would make debugging much quicker.

In addition, I noticed the _write_arg_keys function is called only once, and it modifies the list being passed. When you define a function, you ideally want to make it crystal clear what it is supposed to do. In this case, the only way to know is to inspect it directly. This defeats the improved structuring that splitting the code into self-contained functions usually brings, since you force the reader to search for the function, decreasing the code's readability. In my opinion, given that you only use the function once, it would be better to include it directly where it is now being called.

I also suspect if len(self.args) > 0: is not needed, since Arguments = would be correct syntax even without arguments, but please correct me (I haven't tested it).

Good points, thank you. The functionality of _write_arg_keys has been moved out of a dedicated function and into prepare_multi_job_condor directly. I've written a quick argument checker which looks like:

def _check_script_arguments(self): script_args = subprocess.run([self.script, "--args"], capture_output=True, text=True) if script_args.returncode != 0: result = script_args.stdout.strip().split("\n") else: result = [] assert(result == list(self.args.keys())), "Provided arguments {} do not match accepted arguments {}.".format(list(self.args.keys()), result)

This requires that any passed script has a named argument --args that would print out the names of the arguments:

--args) echo "radius" echo "particles" echo "pileup" exit 1;;

If you have any ideas on a more flexible way of doing this let me know, or if you think the name should be changed.
I'll additionally write a dummy script that takes no arguments and determine if everything works as expected if we drop the if len(self.args) > 0 check.

Adding the assertion that the arguments in config.yaml match the arguments accepted by <script> is proving quite tricky. If either of you have some ideas on how this can be done generally and robustly please let me know, but if not this might deserve to have a dedicated PR.

When a script runs from the CLI you can access its required arguments via --help, which means the information can be retrieved. I think this is important for this PR, but we can discuss.

bye_splits/production/submit_scripts/job_submit.py

README.md

mchiusi · 2023-10-13T15:16:38Z

README.md

 The `PU0` files above were merged and are stored under `/data_CMS/cms/alves/L1HGCAL/`, accessible to LLR users and under `/eos/user/b/bfontana/FPGAs/new_algos/`, accessible to all lxplus and LLR users. The latter is used since it is well interfaced with CERN services. The `PU200` files were merged and stored under `/eos/user/i/iehle/data/PU200/<particle>/`.

+<a id="job-submission"></a>
+## Job Submission


You could say explicitly that the Job submission can be use for multiple purposes, and it can be a good alternative to the local Skimming procedure presented in this README, just a few lines above.

About this, I was trying to run produce.py, but I got the following error while trying to read the input file for the skimming, specified in the config.yaml file:

Error in <TFile::TFile>: file /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root does not exist

Everything is good if I try to run it in local. Do you know if we should add some additional configuration options to avoid this kind of access problems?

Yes yes I'll mention the skimming procedure application too!

As for running it on EOS, have you confirmed that you have access to /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root just inside of your terminal? I do indeed remember using files on EOS being a major headache, however. I'll look into the problem further, but for now can you see if adding /opt/exp_soft/cms/t3/eos-login -username $USER -init to your script fixes the issue? Obviously this assumes that you have the same LLR/CERN username, which I don't, so if it's not replace $USER directly.

Yes from my terminal I have access to this file. I will try to run it adding the script you suggested directly in the script and I will let you know.

bfonta

From @mchiusi comments, I realized there are many tiny details that might go wrong once someone else tries the job submission, so I should also test it on my side.
Would it be ok to include a quick recipe on the README in order to run a dummy test? You could add a dummy script that prints a parameter to a file (HTCondor *.out and *err files, for instance), and I would try to run it on my side by defining myself an array of parameters in config.yaml following your instructions. This would mimic running someone else's script with some parameters of my choice.

bye_splits/production/submit_scripts/job_submit.py

README.md

bye_splits/production/submit_scripts/job_submit.py

isehle · 2023-10-16T17:08:43Z

The latest commit allows you to pass the Python files, i.e. run_cluster.py or produce.py for example, as script directly. The executable written by write_exec_file() in job_submit.py and set in the .sub file now calls the Python function directly on the correct batches of arguments. This has been tested and has worked well, but I still need to address a few of the other comments before this is ready to be merged.

isehle · 2023-10-17T15:25:53Z

Everything should be accounted for in the latest commit except for the things in the three remaining unresolved conversations. Personally I think these three things (argument verification, EOS files, and multiple iteration params) should wait for a dedicated PR because they're all fairly complicated. I'm happy to discuss more, but if you agree and everything else looks good, please resolve these remaining conversations.

bfonta · 2023-10-24T21:45:54Z

@isehle Regarding argument verification, here goes my proposed solution.

You could run the script separately via the OS and parse its output. This is messy, it requires some non-python interface (even if within python), and it will be hard to generalize.
The alternative is to do everything within Python; the compromise is to slightly change the way optional commands are now being defined. I propose we start encapsulating them into a function (a class would be more correct but it is an overkill for our current use-case).

You can find a running example here. You can run python inspect_script.py from the "Shell" tab. If the solution suits you, I propose you add it to the dummy script, to your own script and to produce.py, keeping all other scripts as they are. The reason is that these three scripts would be needed for testing. We can modify the others later on if really needed.

mchiusi · 2023-10-25T09:13:44Z

I like Bruno's proposal to encapsulate the arguments of each script in a function so that they can be called from outside the script. In this way you could easily verify which are the needed arguments to launch a specific script.

I would also suggest adding the possibility to specify command-line arguments when running the job_submit.py script. The final result I have in mind is, in the case of one wants to run the produce.py script, something like this:

python bye_splits/production/submit_scripts/job_submit.py --script produce.py --particles photons

The command-line arguments should take precedence over those specified in config.yaml, which will be used in case they are not specified via the command line.

Let me know what you think, and if you'd like, we can discuss it further!

isehle · 2023-10-30T11:08:40Z

@bfonta, thanks for the suggestion and head start! I figured that it's probably best to start with a class even though I agree that it's overkill for now—it's just going to be easier to extend in the future, I imagine. You can find my approach here, where you can run python run_script.py which uses the Arguments class in arg_script.py to first verify that the arguments are expected by test_script.py, and then adds them and runs the script.
@mchiusi, I agree that it would be a nice idea to allow command line arguments at the level of job_submit.py, though I need to think a bit further on the best way to implement them—I have some ideas but it isn't trivial, so I'll keep you both updated.

isehle · 2023-10-30T13:13:11Z

The class has been updated so that it now will take command line arguments with precedence over the default arguments, and everything is handled directly by the class itself so you can simply use Arguments.run_script(<arg_dict>) and (potential) replacement, verification, and running will be done behind the scenes. You can see the implementation at the same link as before.

…ication

isehle · 2023-10-31T15:12:30Z

The latest commit should take care of all argument adding, combining (default+command line), and verification. Tested all combinations on the dummy script, and ran a few combinations with run_cluster.py. As expected, it does indeed require changing the original script(s). Arguments must be defined with a dictionary titled arg_dict and added using Arguments.add_args(arg_dict). The keys should be the argument name (including --, this is required for now), and the values should be their own dictionaries containing **kwargs that are accepted by arparse.add_arguments().
Example:

arg_dict = {
    "--float_arg": {
        "help": "Dummy float",
        "type": float,
        "required": True
    },
    "--str_arg": {
        "help": "Dummy string.",
        "type": str,
        "required": True,
    },
    # Generic argument will be interpreted as a string
    "--gen_arg": {
        "help": "Dummy generic.",
        "required": False,
        "default": None
    },
    "--flag": {
        "help": "Dummy flag number 2.",
        "action": "store_true",
        "required": False
    }
}

def dummy(pars):
    """Dummy script for testing HT Condor job submission.
    Prints passed argument key/vals and the type of <val>."""

    for key, val in pars.items():
        print("Passed {} for --{}, read as type {}.".format(val, key, type(val)))

if __name__ == "__main__":
    args = Arguments(script=__file__)
    FLAGS = args.add_args(description="A dummy script to submit.", arg_dict=arg_dict)
    dummy(FLAGS)

isehle · 2023-10-31T15:20:55Z

I suggest playing with tests/submission/dummy_run.py to get a feel for how Arguments handles things. A few things to note:

It is not necessary for the dictionary in the running script to have any specific name. It must be arg_dict in the script that accepts the arguments so that these can be read by the Arguments class and verified against what's passed in the running script. The name in the running script however has no effect.
Try removing a required argument in the run script argument dictionary.
Try changing arguments from the command line.
Try adding an argument that isn't accepted by the "fundamental" script.

isehle requested review from bfonta and mchiusi as code owners October 9, 2023 12:12

bfonta reviewed Oct 9, 2023

View reviewed changes

bfonta self-requested a review October 11, 2023 19:23

bfonta reviewed Oct 11, 2023

View reviewed changes

isehle added 7 commits October 12, 2023 17:07

copy over submission related scripts

a17bce8

mostly variable / key name updates, addition of some needed functions

d86dd71

remove depricated file

fe04b0d

updated README.md

cc683e9

better argument handling and variable naming

85308fa

general clean up and documentation

0073ffe

implementation of final comments

8161d31

isehle force-pushed the JobSubmission branch from 29a9a85 to 8161d31 Compare October 12, 2023 15:09

bfonta reviewed Oct 12, 2023

View reviewed changes

minor updates/corrections

482e84c

mchiusi requested changes Oct 13, 2023

View reviewed changes

implemented Marco's comments

fc9e000

bfonta reviewed Oct 15, 2023

View reviewed changes

use python files as <script> directly

5dd003f

added dummy script for job testing; updated documentation; cleanup

511fcf0

introduce Arguments class to handle argument adding, combining, verif…

fa47e2c

…ication

Job submission #48

Are you sure you want to change the base?

Job submission #48

Uh oh!

Conversation

isehle commented Oct 9, 2023

PR Description

How has this been tested?

Checklist

Uh oh!

bfonta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isehle commented Oct 11, 2023

Uh oh!

bfonta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isehle commented Oct 12, 2023

Uh oh!

bfonta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchiusi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfonta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

bfonta left a comment •

edited

Loading

bfonta commented Oct 24, 2023 •

edited

Loading

mchiusi commented Oct 25, 2023 •

edited

Loading