Skip to content

Conversation

@joshua-seals
Copy link
Collaborator

No description provided.

frostyfan109 and others added 24 commits May 23, 2024 16:06
…to add-edu-dev-student-and-prof-contexts
…to add-edu-dev-student-and-prof-contexts
…f-contexts

Add eduhelx-dev-student and eduhelx-dev-profesor brands
BUG removing the wordy line
fix bug with expiry default constraint
change anon to full to fix serialization bug
Make 0.5cpu the standard for all apps
Makefile Outdated

ifeq "${DEBUG}" "true"
LOG_LEVEL := "debug"
LOG_LEVEL := DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be changed. This variable is only used in the run command of gunicorn which accepts the following values for --log-level, with the default being 'info'

  • 'debug'
  • 'info'
  • 'warning'
  • 'error'
  • 'critical'


# Dividing resources by 2 for reservations
# This helps in cloud scenarios to reduce costs.
if self.cpus >= 0.5:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be less than or equal to, or what you have? Currently if the cpus are set to anything like 2 or 6 or whatever, it will be set to 0.5, which seems wrong. I think you just want to set the requests equal to half of the limits for any cpu value over 2, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was a change based on PJ's request. Originally it was half like you are saying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the way this should work is that the 'reservations' value from each docker compose file is used for the requests in a pod and then the limits value would be the maximum value that a user can select when creating the helx app. This would mean that someone needs to make sure that all the 'reservations' values in each docker compose is set right (just enough to get each helx app pod going).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5 CPUs requests might be too much, but that will just depend on each app. Better than what it was though. For the helx apps running in the cluster last night, that would be around 50 CPUs that are being reserved for the apps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the requests value for CPUs needs to be set to the average amount of CPUs the container needs to do its job under normal load (not idle) and the limits should be set to the maximum value we think students will ever need for the application to do its work in a reasonable amount of time. We need to make sure that containers aren't slow and sluggish if there's a lot of users running pods, and if we find that the number of CPUs being used as the requests when requests are set to the average the container needs to do its job under normal load is higher than the number of CPUs provisioned in ASHE, that's an indication that ASHE is underprovisioned for what we need.

To summarize, here's what I think we should do:

  1. enable pod-reaper again and make it reap pods after 24 hours of idle time
  2. tell people to use reasonable values for the requests in helx-apps depending on the image, and set the limits to whatever we think students should never need to use
  3. tell students to use LIMIT if their queries are taking too long or if the container crashes
  4. disable pgadmin backups until we fix the sessions bug, which is causing the backups to fail
  5. if we realize that even if requests are set to the minimum possible while applications are still able to function under normal load and ASHE is still having problems when we do 1-4, then we should look into increasing the CPU provisions in ASHE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring out the "idle time" in #1 is what the problem has been in the past, unless someone has figured that trick out? In some of the labs I've done recently there is a timer that is shown in the main web app, which can be increased or decreased while someone is viewing the page. When the timer runs out their lab environment is destroyed. Maybe something like that would work better than idle time for the helx apps.

Copy link
Collaborator Author

@joshua-seals joshua-seals Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod-reaper isn't the best solution or the easiest. I like the low hanging fruit here. As far as .5cpu for all, my hope right now is that something like this can get us to a ballpark that we can work out in the future as we fine tune - I'd rather have too much than too little going from being super greedy like we are now with requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood how pod-reaper works. We don't have to use that.

I still think the other 4 points I laid out are better than setting the requests to no more than 0.5 for every tycho-launched pod, but I'm okay with having this change go through if we commit to revisiting it soon after we've made the changes I suggest. I think long-term we should probably set it to 1cpu as the max requests can be instead of 0.5.

Are y'all good with that plan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be data driven, based on whether users experience throttling or not. I am of similar mind as @pj-linebaugh in thinking we will actually give less cpu once we've used this for a while and find no issues. I have no issues with your 2-4 plan though at all. That all seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I think the reservations values in the docker compose files for each helx app should be used for their requests values and the limits values in each docker compose file can be used for the limits in the pods and/or the maximum value that users can choose when launching the app. Maybe even don't give them the option, except for adding a GPU. I guess we can set standard values for a helx app's requests and limits if they aren't specified in the docker compose file. Each helx app is different and their resources need to be set accordingly. The same helx app can also need different resources allocated to it depending on the project. One project might need pgadmin to work with a small dataset and another might need a pgadmin that works with a huge dataset.

name = 'core'

def ready(self):
import core.signals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unfinished?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's importing signals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core.signals is an empty file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frostyfan109 What is this empty file about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signals were originally going to be used for a feature, I think. We can delete it, but it's basically just boilerplate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

fix setting
import uuid
import yaml
import copy
from typing import TypedDict, Optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using these imports?

@joshua-seals joshua-seals merged commit 590ed00 into master Mar 12, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants