-
Notifications
You must be signed in to change notification settings - Fork 10
Job submission #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Job submission #48
Conversation
bfonta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have this close to integration! A few general comments:
- IIRC
fileswas a bad argument name that had to be changed - Instead of handling the submission with files, cluttering the environment, it would be cleaner to use other options for the
queue ...command (see the docs) - Could you please remove the
produce.ccfile, which is no longer needed? - You are missing doc strings everywhere
|
The two commits that I just pushed should address all of the comments, let me know if there's still something I can update. |
bfonta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more comments.
Regarding the previous review, you did not address the general comments, and importantly the comment on the queue-related command, which is quite central to the PR. It would require a bit of extra work, but would be more robust for future changes and more readable. It would make the code shorter, since IIUC much of the code for checking the files would not be required any longer.
And thanks for documenting the code; it makes a big difference!
|
Latest commits should resolve the rest of the comments. |
bfonta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor details. In addition, could you comment on the following:
IRC files was a bad argument name that had to be changed
If possible, could @mchiusi also have a careful look?
mchiusi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, Isaac!
I've been testing the flexibility of your code by running another script, produce.py, and I've identified some critical points that need to be improved. For the rest, everything is good and well-written.
| arg_keys = "Arguments =" | ||
| for arg in self.args.keys(): | ||
| arg_keys += " $({})".format(arg) | ||
| arg_keys += "\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it a bit more flexible. In this way, for instance, I want to run produce.py I will get an error, since the script does not recognise the option --radius which is reported here.
It would be good is only the needed arguments are taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's the point of the configuration file; it's up to the user to modify config.yaml with the arguments and values that script accepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I partially agree with both comments: I would add a check that guarantees the argument in the configuration file is supported by the script, raising an exception otherwise. This would make debugging much quicker.
In addition, I noticed the _write_arg_keys function is called only once, and it modifies the list being passed. When you define a function, you ideally want to make it crystal clear what it is supposed to do. In this case, the only way to know is to inspect it directly. This defeats the improved structuring that splitting the code into self-contained functions usually brings, since you force the reader to search for the function, decreasing the code's readability. In my opinion, given that you only use the function once, it would be better to include it directly where it is now being called.
I also suspect if len(self.args) > 0: is not needed, since Arguments = would be correct syntax even without arguments, but please correct me (I haven't tested it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, thank you. The functionality of _write_arg_keys has been moved out of a dedicated function and into prepare_multi_job_condor directly. I've written a quick argument checker which looks like:
def _check_script_arguments(self):
script_args = subprocess.run([self.script, "--args"], capture_output=True, text=True)
if script_args.returncode != 0:
result = script_args.stdout.strip().split("\n")
else:
result = []
assert(result == list(self.args.keys())), "Provided arguments {} do not match accepted arguments {}.".format(list(self.args.keys()), result)
This requires that any passed script has a named argument --args that would print out the names of the arguments:
--args)
echo "radius"
echo "particles"
echo "pileup"
exit 1;;
If you have any ideas on a more flexible way of doing this let me know, or if you think the name should be changed.
I'll additionally write a dummy script that takes no arguments and determine if everything works as expected if we drop the if len(self.args) > 0 check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the assertion that the arguments in config.yaml match the arguments accepted by <script> is proving quite tricky. If either of you have some ideas on how this can be done generally and robustly please let me know, but if not this might deserve to have a dedicated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a script runs from the CLI you can access its required arguments via --help, which means the information can be retrieved. I think this is important for this PR, but we can discuss.
| The `PU0` files above were merged and are stored under `/data_CMS/cms/alves/L1HGCAL/`, accessible to LLR users and under `/eos/user/b/bfontana/FPGAs/new_algos/`, accessible to all lxplus and LLR users. The latter is used since it is well interfaced with CERN services. The `PU200` files were merged and stored under `/eos/user/i/iehle/data/PU200/<particle>/`. | ||
|
|
||
| <a id="job-submission"></a> | ||
| ## Job Submission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could say explicitly that the Job submission can be use for multiple purposes, and it can be a good alternative to the local Skimming procedure presented in this README, just a few lines above.
About this, I was trying to run produce.py, but I got the following error while trying to read the input file for the skimming, specified in the config.yaml file:
Error in <TFile::TFile>: file /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root does not exist
Everything is good if I try to run it in local. Do you know if we should add some additional configuration options to avoid this kind of access problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes yes I'll mention the skimming procedure application too!
As for running it on EOS, have you confirmed that you have access to /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root just inside of your terminal? I do indeed remember using files on EOS being a major headache, however. I'll look into the problem further, but for now can you see if adding /opt/exp_soft/cms/t3/eos-login -username $USER -init to your script fixes the issue? Obviously this assumes that you have the same LLR/CERN username, which I don't, so if it's not replace $USER directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes from my terminal I have access to this file. I will try to run it adding the script you suggested directly in the script and I will let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @mchiusi comments, I realized there are many tiny details that might go wrong once someone else tries the job submission, so I should also test it on my side.
Would it be ok to include a quick recipe on the README in order to run a dummy test? You could add a dummy script that prints a parameter to a file (HTCondor *.out and *err files, for instance), and I would try to run it on my side by defining myself an array of parameters in config.yaml following your instructions. This would mimic running someone else's script with some parameters of my choice.
|
The latest commit allows you to pass the Python files, i.e. |
|
Everything should be accounted for in the latest commit except for the things in the three remaining unresolved conversations. Personally I think these three things (argument verification, EOS files, and multiple iteration params) should wait for a dedicated PR because they're all fairly complicated. I'm happy to discuss more, but if you agree and everything else looks good, please resolve these remaining conversations. |
|
@isehle Regarding argument verification, here goes my proposed solution. You could run the script separately via the OS and parse its output. This is messy, it requires some non-python interface (even if within python), and it will be hard to generalize. You can find a running example here. You can run |
|
I like Bruno's proposal to encapsulate the arguments of each script in a function so that they can be called from outside the script. In this way you could easily verify which are the needed arguments to launch a specific script. I would also suggest adding the possibility to specify command-line arguments when running the job_submit.py script. The final result I have in mind is, in the case of one wants to run the The command-line arguments should take precedence over those specified in Let me know what you think, and if you'd like, we can discuss it further! |
|
@bfonta, thanks for the suggestion and head start! I figured that it's probably best to start with a class even though I agree that it's overkill for now—it's just going to be easier to extend in the future, I imagine. You can find my approach here, where you can run |
|
The class has been updated so that it now will take command line arguments with precedence over the default arguments, and everything is handled directly by the class itself so you can simply use |
|
The latest commit should take care of all argument adding, combining (default+command line), and verification. Tested all combinations on the dummy script, and ran a few combinations with |
|
I suggest playing with
|
PR Description
This PR is specifically for submitting jobs on HT Condor, using the new Python script
job_submit.pylocated inbye_splits/production/submit_scripts. All configuration is done inconfig.yamlunderjob. This configuration should contain the usual Condor related variables (user, proxy, queue, local) as well as a path to the script you wish to run, and a list of arguments that the script accepts. There are then configurations for each particle type listed under their names (plural), which should contain:submit_dir, i.e. the path where you want your submission related scripts to be written to and read from;fileswhich should contain a path to atxtfile with your various parameters, line delimited;files_per_batchwhich is self explanatory.In the case of the cluster size studies,
filespoints to a.txtfile with all clustering radii, line delimited.job_submit.pycontains three classes,JobBatcheswhich splits the total number of jobs into batches as defined above. If the chosen number can't be split evenly, the last batch will be shorter than the preceding ones. The second class,CondJobBasewrites the individual batch files if needed, and writes the.shand.subscripts needed for the job(s). Finally,CondJobcreatesconfigs/,jobs/, andlogs/sub-directories under thesubmit_dirin the configuration script, then creates and runs the jobs. The base name for all files written byjob_submit.pywill be the same as the basename for thescriptbeing called, as set in the configuration. The file information is stored in buffer-lists before being written, and are checked against files with the same name before being written. If the contents of the new file are equivalent to a previously written file, the script will opt to use that file rather than write a new one. Otherwise, it will increment the version number, write the new file and use that. This is handled through a few new functions incommon.by, with the main function beingconditional_write.How has this been tested?
Tested against all particle types with both 0 and 200 PU.
Checklist