Skip to content

Conversation

@Amdrel
Copy link
Contributor

@Amdrel Amdrel commented Dec 4, 2025

Description

Phabricator Ticket: https://phabricator.wikimedia.org/T410732

  • The restart button works only for tasks created by the current logged-in user. For those like Amdrel and me who have admin privileges in the UI, we should be able to restart failed tasks created by other users
  • We don't see on which encoding instance the task was running, it's a guessing game right now, it should be displayed somewhere

For normal users, they have no idea of the current load. The frontend should display, and refresh, how many tasks are currently processed / waiting before their own, so that they can see when the system is just busy and not stuck. Sometimes users cancel theirs tasks because they think it's stuck, while there is simply someone that created many tasks before them.

Changes

  • Users in the sudoers list in Redis (maintainers) can now manage other users' tasks (relogin required for old sessions)
  • Worker statistics are now displayed on the dashboard and are updated ASAP
    • Currently displayed values are as follows:
      • Capacity: The amount of busy workers and the amount of available workers
      • Utilization: A percentage calculated from the used capacity
      • Pending: The amount of tasks that aren't being processed by any workers
    • There is a new cron job as well that updates the stats every 5 minutes in order to query the current capacity in the event workers go up or down. However, as it is currently configured in backend.pp, this currently runs on all worker instances although it doesn't need to. Ideally this job would only run on one instance.
    • Stats will not display until after the first cron job run is done
    • If stat writes fail for any reason then they shouldn't impact the workers
    • A distributed lock is used to ensure updates to stats by workers are atomic
  • Running tasks now display the instance that jobs are running on. Old tasks created prior to this change will display 'N/A'.

@Amdrel
Copy link
Contributor Author

Amdrel commented Dec 4, 2025

In this change I adjusted the path to the healthcheck cron job since it looked incorrect to me. I can't verify this as I can't SSH into worker instances to verify. Please let me know if I need to update the cron job definitions.

@don-vip
Copy link
Collaborator

don-vip commented Dec 4, 2025

Ah yes, I'm checking if I can grant you SSH access now.

@don-vip
Copy link
Collaborator

don-vip commented Dec 4, 2025

@Amdrel now I was able to add you as member of:

  • toolforge video2commons project
  • toolforge video2commons-socket-io project
  • toolforge video2commons-test project
  • cloudvps video project

So now you should be able to become video2commons(-io,-test) on toolforge bastion, and ssh to the cloudvps instances. Do you need some guidance to try?

@Amdrel
Copy link
Contributor Author

Amdrel commented Dec 4, 2025

Thank you! I am able to ssh in and become video2commons on tools-bastion, and I can see instances in Cloud VPS and their logs in the web interface. As for ssh access into Cloud VPS servers, is this done through the tools bastion (tunnel), and if so do I use the same private key?

Edit: I found https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances and I'm currently reading it

@Amdrel
Copy link
Contributor Author

Amdrel commented Dec 4, 2025

I'm able to ssh into encoding instances now. Thank you!

@Amdrel
Copy link
Contributor Author

Amdrel commented Dec 4, 2025

I've noticed a misconfiguration with the cron jobs (not just the new one), so please don't merge this for now. I'll update T407375 with more information soon as it pertains to that ticket.

@Amdrel Amdrel changed the title Allow admins to restart tasks and added worker stats to the dashboard [DO NOT MERGE] Allow admins to restart tasks and added worker stats to the dashboard Dec 8, 2025
@Amdrel Amdrel changed the title [DO NOT MERGE] Allow admins to restart tasks and added worker stats to the dashboard Allow admins to restart tasks and added worker stats to the dashboard Dec 15, 2025
@Amdrel
Copy link
Contributor Author

Amdrel commented Dec 15, 2025

I've added some changes, though I had to squash it to make merging in minified JavaScript changes from master a bit easier to accomplish since another patch made some changes there.

Since the new job needs to directly connect to workers using Celery, which only the workers have as a dependency, I kept the job there. To prevent all the workers from clobbering each other with inspect requests I added some random sleep and a timestamp check. I also updated the job to run once an hour.

For context, the job is only needed for both the initial stats collection, and to update the active worker count that's displayed on the frontend. The workers will update the stats as they take on and complete jobs. The workers will need to be restarted for this change, and stats will be a little inaccurate until all of them get the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants