-
Notifications
You must be signed in to change notification settings - Fork 25
Profiling #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling #260
Conversation
tristan-f-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Almost every comment is a nitpick, except for the test comment - something to verify (not for correctness) that profiling doesn't suddenly error would be reassuring.
|
[Apologies for the force pushes - I was trying to fix the merge conflict's behavior on |
9dba53e to
2435903
Compare
|
One additional note when it comes time to verify this unit of work in the CHTC pool -- the tester should add the following constraint to their submit file (if running everything on one EP): or this line under the This pins jobs to execution points that a) run the minimal required version of condor and b) don't enable another feature (disk enforcement) that doesn't play well with what profiling does to cgroups. |
|
A question out of curiosity: what exactly happens if
How does the created cgroup sibling mess with the way that this |
I don't fully recall -- this is something the HTCondor developer who most heavily works on their cgroup management said would be needed. I suspect it has something to do with the logical volume mount in the outer container hiding the cgroup tree or making it unwritable. |
|
This created a bad merge conflict with #283. I would like to resolve it myself, but I don't know how to test the profiling changes here well enough to guarantee that I didn't miss anything. |
@jhiemstrawisc are you able to check out this merge conflict at some point? |
agitter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass through the code. I would still like to try running it myself and will then comment again.
Since this initial pull request, we now have new pathway reconstruction algorithms to support.
This commit adds the needed bits for the main Python process to create a peer cgroup (linux only) such that when profiling is enabled, the PRM containers are run under this cgroup with the `memory.peak` and `cpu.stat` controllers enabled. Unfortunately we can't just point Python at some PID, because the PRM containers launch various processes without reporting the PIDs back to the originating process. This prevents us from regular inline monitoring.
bdcd4d5 to
29f0001
Compare
Documentation build overview
Show files changed (4 files in total): 📝 4 modified | ➕ 0 added | ➖ 0 deleted
|
|
@agitter one last thing I'd like your input on is where to document what you brought up in the remaining unresolved comment. On a side note, I ran another round of manual tests in CHTC's HTCondor pool to double check nothing broke after cleaning up all the merge conflicts. Looks like everything still works! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RWR and ST_RWR are missing the out_dir param in run_container_and_log, which is causing CI to fail. (I can fix and commit that change to this branch if that's okay.)
|
ResponseNet is also missing @tristan-f-r did you already re-review after Justin resolved merge conflicts or should I? |
|
I'll do another pass 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still works. The merge conflict resolution introduced a double print of the container stdout (see review comment below), but this otherwise still seems fine 👍
| mycgroup = os.path.join("/sys/fs/cgroup", cgroup_rel.lstrip("/")) | ||
| peer_cgroup = os.path.join(os.path.dirname(mycgroup), f"spras-peer-{os.getpid()}") | ||
|
|
||
| # Create the peer cgroup directory | ||
| try: | ||
| os.makedirs(peer_cgroup, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we avoid pathlib here on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this unresolved comment but am likely to approve anyway to get this merged. We may not be consistent with pathlib throughout the code base even though I agree it would be better to be.
…RWR,STRWR These were PRMs whose `run_container` arguments were missed when I was updating everything to pass an output dir around
The removed functions are in `profiling.py` and should have been removed from `containers.py`. This also restores a comment that was removed while fixing merge conflicts.
|
I think I finally cleaned up the various sources of CI failures. On a side note, something's up with This has been going on for awhile, so I started authoring all commits with |
|
|
tristan-f-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I would still like to see pathlib usage in profiling.py, but this looks good! 👍)
agitter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not run this yet but all major comments have been addressed.
No description provided.