Skip to content

Conversation

@isehle
Copy link
Collaborator

@isehle isehle commented Oct 9, 2023

PR Description

This PR is specifically for submitting jobs on HT Condor, using the new Python script job_submit.py located in bye_splits/production/submit_scripts. All configuration is done in config.yaml under job. This configuration should contain the usual Condor related variables (user, proxy, queue, local) as well as a path to the script you wish to run, and a list of arguments that the script accepts. There are then configurations for each particle type listed under their names (plural), which should contain:

  1. submit_dir, i.e. the path where you want your submission related scripts to be written to and read from;
  2. files which should contain a path to a txt file with your various parameters, line delimited;
  3. files_per_batch which is self explanatory.

In the case of the cluster size studies, files points to a .txt file with all clustering radii, line delimited.
job_submit.py contains three classes, JobBatches which splits the total number of jobs into batches as defined above. If the chosen number can't be split evenly, the last batch will be shorter than the preceding ones. The second class, CondJobBase writes the individual batch files if needed, and writes the .sh and .sub scripts needed for the job(s). Finally, CondJob creates configs/, jobs/, and logs/ sub-directories under the submit_dir in the configuration script, then creates and runs the jobs. The base name for all files written by job_submit.py will be the same as the basename for the script being called, as set in the configuration. The file information is stored in buffer-lists before being written, and are checked against files with the same name before being written. If the contents of the new file are equivalent to a previously written file, the script will opt to use that file rather than write a new one. Otherwise, it will increment the version number, write the new file and use that. This is handled through a few new functions in common.by, with the main function being conditional_write.

How has this been tested?

Tested against all particle types with both 0 and 200 PU.

Checklist

  • I have performed a self-review of my code
  • My code follows the naming convention of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

@isehle isehle requested review from bfonta and mchiusi as code owners October 9, 2023 12:12
Copy link
Collaborator

@bfonta bfonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this close to integration! A few general comments:

  • IIRC files was a bad argument name that had to be changed
  • Instead of handling the submission with files, cluttering the environment, it would be cleaner to use other options for the queue ... command (see the docs)
  • Could you please remove the produce.cc file, which is no longer needed?
  • You are missing doc strings everywhere

@isehle
Copy link
Collaborator Author

isehle commented Oct 11, 2023

The two commits that I just pushed should address all of the comments, let me know if there's still something I can update.

@bfonta bfonta self-requested a review October 11, 2023 19:23
Copy link
Collaborator

@bfonta bfonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more comments.
Regarding the previous review, you did not address the general comments, and importantly the comment on the queue-related command, which is quite central to the PR. It would require a bit of extra work, but would be more robust for future changes and more readable. It would make the code shorter, since IIUC much of the code for checking the files would not be required any longer.

And thanks for documenting the code; it makes a big difference!

@isehle
Copy link
Collaborator Author

isehle commented Oct 12, 2023

Latest commits should resolve the rest of the comments.

Copy link
Collaborator

@bfonta bfonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor details. In addition, could you comment on the following:

IRC files was a bad argument name that had to be changed

If possible, could @mchiusi also have a careful look?

Copy link
Collaborator

@mchiusi mchiusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, Isaac!
I've been testing the flexibility of your code by running another script, produce.py, and I've identified some critical points that need to be improved. For the rest, everything is good and well-written.

Comment on lines 56 to 59
arg_keys = "Arguments ="
for arg in self.args.keys():
arg_keys += " $({})".format(arg)
arg_keys += "\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it a bit more flexible. In this way, for instance, I want to run produce.py I will get an error, since the script does not recognise the option --radius which is reported here.
It would be good is only the needed arguments are taken into account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's the point of the configuration file; it's up to the user to modify config.yaml with the arguments and values that script accepts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I partially agree with both comments: I would add a check that guarantees the argument in the configuration file is supported by the script, raising an exception otherwise. This would make debugging much quicker.

In addition, I noticed the _write_arg_keys function is called only once, and it modifies the list being passed. When you define a function, you ideally want to make it crystal clear what it is supposed to do. In this case, the only way to know is to inspect it directly. This defeats the improved structuring that splitting the code into self-contained functions usually brings, since you force the reader to search for the function, decreasing the code's readability. In my opinion, given that you only use the function once, it would be better to include it directly where it is now being called.

I also suspect if len(self.args) > 0: is not needed, since Arguments = would be correct syntax even without arguments, but please correct me (I haven't tested it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thank you. The functionality of _write_arg_keys has been moved out of a dedicated function and into prepare_multi_job_condor directly. I've written a quick argument checker which looks like:

    def _check_script_arguments(self):
        script_args = subprocess.run([self.script, "--args"], capture_output=True, text=True)
        if script_args.returncode != 0:
            result = script_args.stdout.strip().split("\n")
        else:
            result = []
        
        assert(result == list(self.args.keys())), "Provided arguments {} do not match accepted arguments {}.".format(list(self.args.keys()), result)

This requires that any passed script has a named argument --args that would print out the names of the arguments:

        --args)
           echo "radius"
           echo "particles"
           echo "pileup"
           exit 1;;

If you have any ideas on a more flexible way of doing this let me know, or if you think the name should be changed.
I'll additionally write a dummy script that takes no arguments and determine if everything works as expected if we drop the if len(self.args) > 0 check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the assertion that the arguments in config.yaml match the arguments accepted by <script> is proving quite tricky. If either of you have some ideas on how this can be done generally and robustly please let me know, but if not this might deserve to have a dedicated PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a script runs from the CLI you can access its required arguments via --help, which means the information can be retrieved. I think this is important for this PR, but we can discuss.

The `PU0` files above were merged and are stored under `/data_CMS/cms/alves/L1HGCAL/`, accessible to LLR users and under `/eos/user/b/bfontana/FPGAs/new_algos/`, accessible to all lxplus and LLR users. The latter is used since it is well interfaced with CERN services. The `PU200` files were merged and stored under `/eos/user/i/iehle/data/PU200/<particle>/`.

<a id="job-submission"></a>
## Job Submission
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could say explicitly that the Job submission can be use for multiple purposes, and it can be a good alternative to the local Skimming procedure presented in this README, just a few lines above.

About this, I was trying to run produce.py, but I got the following error while trying to read the input file for the skimming, specified in the config.yaml file:

Error in <TFile::TFile>: file /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root does not exist

Everything is good if I try to run it in local. Do you know if we should add some additional configuration options to avoid this kind of access problems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes yes I'll mention the skimming procedure application too!

As for running it on EOS, have you confirmed that you have access to /eos/user/b/bfontana/FPGAs/new_algos/photons_0PU_bc_stc_hadd.root just inside of your terminal? I do indeed remember using files on EOS being a major headache, however. I'll look into the problem further, but for now can you see if adding /opt/exp_soft/cms/t3/eos-login -username $USER -init to your script fixes the issue? Obviously this assumes that you have the same LLR/CERN username, which I don't, so if it's not replace $USER directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes from my terminal I have access to this file. I will try to run it adding the script you suggested directly in the script and I will let you know.

Copy link
Collaborator

@bfonta bfonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @mchiusi comments, I realized there are many tiny details that might go wrong once someone else tries the job submission, so I should also test it on my side.
Would it be ok to include a quick recipe on the README in order to run a dummy test? You could add a dummy script that prints a parameter to a file (HTCondor *.out and *err files, for instance), and I would try to run it on my side by defining myself an array of parameters in config.yaml following your instructions. This would mimic running someone else's script with some parameters of my choice.

@isehle
Copy link
Collaborator Author

isehle commented Oct 16, 2023

The latest commit allows you to pass the Python files, i.e. run_cluster.py or produce.py for example, as script directly. The executable written by write_exec_file() in job_submit.py and set in the .sub file now calls the Python function directly on the correct batches of arguments. This has been tested and has worked well, but I still need to address a few of the other comments before this is ready to be merged.

@isehle
Copy link
Collaborator Author

isehle commented Oct 17, 2023

Everything should be accounted for in the latest commit except for the things in the three remaining unresolved conversations. Personally I think these three things (argument verification, EOS files, and multiple iteration params) should wait for a dedicated PR because they're all fairly complicated. I'm happy to discuss more, but if you agree and everything else looks good, please resolve these remaining conversations.

@bfonta
Copy link
Collaborator

bfonta commented Oct 24, 2023

@isehle Regarding argument verification, here goes my proposed solution.

You could run the script separately via the OS and parse its output. This is messy, it requires some non-python interface (even if within python), and it will be hard to generalize.
The alternative is to do everything within Python; the compromise is to slightly change the way optional commands are now being defined. I propose we start encapsulating them into a function (a class would be more correct but it is an overkill for our current use-case).

You can find a running example here. You can run python inspect_script.py from the "Shell" tab. If the solution suits you, I propose you add it to the dummy script, to your own script and to produce.py, keeping all other scripts as they are. The reason is that these three scripts would be needed for testing. We can modify the others later on if really needed.

@mchiusi
Copy link
Collaborator

mchiusi commented Oct 25, 2023

I like Bruno's proposal to encapsulate the arguments of each script in a function so that they can be called from outside the script. In this way you could easily verify which are the needed arguments to launch a specific script.

I would also suggest adding the possibility to specify command-line arguments when running the job_submit.py script. The final result I have in mind is, in the case of one wants to run the produce.py script, something like this:

python bye_splits/production/submit_scripts/job_submit.py --script produce.py --particles photons

The command-line arguments should take precedence over those specified in config.yaml, which will be used in case they are not specified via the command line.

Let me know what you think, and if you'd like, we can discuss it further!

@isehle
Copy link
Collaborator Author

isehle commented Oct 30, 2023

@bfonta, thanks for the suggestion and head start! I figured that it's probably best to start with a class even though I agree that it's overkill for now—it's just going to be easier to extend in the future, I imagine. You can find my approach here, where you can run python run_script.py which uses the Arguments class in arg_script.py to first verify that the arguments are expected by test_script.py, and then adds them and runs the script.
@mchiusi, I agree that it would be a nice idea to allow command line arguments at the level of job_submit.py, though I need to think a bit further on the best way to implement them—I have some ideas but it isn't trivial, so I'll keep you both updated.

@isehle
Copy link
Collaborator Author

isehle commented Oct 30, 2023

The class has been updated so that it now will take command line arguments with precedence over the default arguments, and everything is handled directly by the class itself so you can simply use Arguments.run_script(<arg_dict>) and (potential) replacement, verification, and running will be done behind the scenes. You can see the implementation at the same link as before.

@isehle
Copy link
Collaborator Author

isehle commented Oct 31, 2023

The latest commit should take care of all argument adding, combining (default+command line), and verification. Tested all combinations on the dummy script, and ran a few combinations with run_cluster.py. As expected, it does indeed require changing the original script(s). Arguments must be defined with a dictionary titled arg_dict and added using Arguments.add_args(arg_dict). The keys should be the argument name (including --, this is required for now), and the values should be their own dictionaries containing **kwargs that are accepted by arparse.add_arguments().
Example:

arg_dict = {
    "--float_arg": {
        "help": "Dummy float",
        "type": float,
        "required": True
    },
    "--str_arg": {
        "help": "Dummy string.",
        "type": str,
        "required": True,
    },
    # Generic argument will be interpreted as a string
    "--gen_arg": {
        "help": "Dummy generic.",
        "required": False,
        "default": None
    },
    "--flag": {
        "help": "Dummy flag number 2.",
        "action": "store_true",
        "required": False
    }
}

def dummy(pars):
    """Dummy script for testing HT Condor job submission.
    Prints passed argument key/vals and the type of <val>."""

    for key, val in pars.items():
        print("Passed {} for --{}, read as type {}.".format(val, key, type(val)))

if __name__ == "__main__":
    args = Arguments(script=__file__)
    FLAGS = args.add_args(description="A dummy script to submit.", arg_dict=arg_dict)
    dummy(FLAGS)

@isehle
Copy link
Collaborator Author

isehle commented Oct 31, 2023

I suggest playing with tests/submission/dummy_run.py to get a feel for how Arguments handles things. A few things to note:

  1. It is not necessary for the dictionary in the running script to have any specific name. It must be arg_dict in the script that accepts the arguments so that these can be read by the Arguments class and verified against what's passed in the running script. The name in the running script however has no effect.
  2. Try removing a required argument in the run script argument dictionary.
  3. Try changing arguments from the command line.
  4. Try adding an argument that isn't accepted by the "fundamental" script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants