-
Notifications
You must be signed in to change notification settings - Fork 4
Develop #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop #376
Conversation
…to add-edu-dev-student-and-prof-contexts
Tycho sessionid
…to add-edu-dev-student-and-prof-contexts
…f-contexts Add eduhelx-dev-student and eduhelx-dev-profesor brands
BUG removing the wordy line
Pr template
fix bug with expiry default constraint
change anon to full to fix serialization bug
Adding resource optimizations
Make 0.5cpu the standard for all apps
Makefile
Outdated
|
|
||
| ifeq "${DEBUG}" "true" | ||
| LOG_LEVEL := "debug" | ||
| LOG_LEVEL := DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be changed. This variable is only used in the run command of gunicorn which accepts the following values for --log-level, with the default being 'info'
- 'debug'
- 'info'
- 'warning'
- 'error'
- 'critical'
|
|
||
| # Dividing resources by 2 for reservations | ||
| # This helps in cloud scenarios to reduce costs. | ||
| if self.cpus >= 0.5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be less than or equal to, or what you have? Currently if the cpus are set to anything like 2 or 6 or whatever, it will be set to 0.5, which seems wrong. I think you just want to set the requests equal to half of the limits for any cpu value over 2, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was a change based on PJ's request. Originally it was half like you are saying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the way this should work is that the 'reservations' value from each docker compose file is used for the requests in a pod and then the limits value would be the maximum value that a user can select when creating the helx app. This would mean that someone needs to make sure that all the 'reservations' values in each docker compose is set right (just enough to get each helx app pod going).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 CPUs requests might be too much, but that will just depend on each app. Better than what it was though. For the helx apps running in the cluster last night, that would be around 50 CPUs that are being reserved for the apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the requests value for CPUs needs to be set to the average amount of CPUs the container needs to do its job under normal load (not idle) and the limits should be set to the maximum value we think students will ever need for the application to do its work in a reasonable amount of time. We need to make sure that containers aren't slow and sluggish if there's a lot of users running pods, and if we find that the number of CPUs being used as the requests when requests are set to the average the container needs to do its job under normal load is higher than the number of CPUs provisioned in ASHE, that's an indication that ASHE is underprovisioned for what we need.
To summarize, here's what I think we should do:
- enable pod-reaper again and make it reap pods after 24 hours of idle time
- tell people to use reasonable values for the requests in helx-apps depending on the image, and set the limits to whatever we think students should never need to use
- tell students to use LIMIT if their queries are taking too long or if the container crashes
- disable pgadmin backups until we fix the sessions bug, which is causing the backups to fail
- if we realize that even if requests are set to the minimum possible while applications are still able to function under normal load and ASHE is still having problems when we do 1-4, then we should look into increasing the CPU provisions in ASHE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figuring out the "idle time" in #1 is what the problem has been in the past, unless someone has figured that trick out? In some of the labs I've done recently there is a timer that is shown in the main web app, which can be increased or decreased while someone is viewing the page. When the timer runs out their lab environment is destroyed. Maybe something like that would work better than idle time for the helx apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod-reaper isn't the best solution or the easiest. I like the low hanging fruit here. As far as .5cpu for all, my hope right now is that something like this can get us to a ballpark that we can work out in the future as we fine tune - I'd rather have too much than too little going from being super greedy like we are now with requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood how pod-reaper works. We don't have to use that.
I still think the other 4 points I laid out are better than setting the requests to no more than 0.5 for every tycho-launched pod, but I'm okay with having this change go through if we commit to revisiting it soon after we've made the changes I suggest. I think long-term we should probably set it to 1cpu as the max requests can be instead of 0.5.
Are y'all good with that plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to be data driven, based on whether users experience throttling or not. I am of similar mind as @pj-linebaugh in thinking we will actually give less cpu once we've used this for a while and find no issues. I have no issues with your 2-4 plan though at all. That all seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, I think the reservations values in the docker compose files for each helx app should be used for their requests values and the limits values in each docker compose file can be used for the limits in the pods and/or the maximum value that users can choose when launching the app. Maybe even don't give them the option, except for adding a GPU. I guess we can set standard values for a helx app's requests and limits if they aren't specified in the docker compose file. Each helx app is different and their resources need to be set accordingly. The same helx app can also need different resources allocated to it depending on the project. One project might need pgadmin to work with a small dataset and another might need a pgadmin that works with a huge dataset.
| name = 'core' | ||
|
|
||
| def ready(self): | ||
| import core.signals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unfinished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's importing signals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core.signals is an empty file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frostyfan109 What is this empty file about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signals were originally going to be used for a feature, I think. We can delete it, but it's basically just boilerplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
fix setting
| import uuid | ||
| import yaml | ||
| import copy | ||
| from typing import TypedDict, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using these imports?
No description provided.