Skip to content

Cooldown state rework#455

Open
IAvecilla wants to merge 108 commits intogcs-2from
cooldown-rework
Open

Cooldown state rework#455
IAvecilla wants to merge 108 commits intogcs-2from
cooldown-rework

Conversation

@IAvecilla
Copy link
Contributor

@IAvecilla IAvecilla commented Dec 23, 2025

Depends on #463

This PR contains several changes to client arguments, dependencies, and the general cooldown state logic. Here’s a summary of the changes made in this PR:

  • All clients are now potential checkpointers and must be prepared to perform that task. At the end of an epoch, during the Cooldown state, a subset of clients (1/3 of the total clients, capped at SOLANA_MAX_NUM_CHECKPOINTERS, which is currently 16) is deterministically selected as checkpointers using a seeded shuffle algorithm based on the round’s random seed.
  • Cooldown is now an opportunistic state, just like the others. Once a client uploads the entire model to external storage, it sends a transaction that moves the run forward to the next epoch. If no transaction is received, the coordinator waits for the full cooldown_time before moving on.
  • Added cancellation support so that once one client successfully checkpoints, the others can cancel their upload tasks and stop uploading in order to move to the next epoch.
  • The --hub-repo and --gcs-bucket flags were removed from the training arguments. Checkpoint destinations are now derived from the coordinator configuration, and there is a single repo or bucket that contains all the checkpoints for the run.
  • Added a new client flag, --skip-checkpoint-upload, for testing purposes. This avoids uploading and saving checkpoints locally and should not be used outside testing environments.
  • Updated the google-cloud-storage crate to version 1.6.
  • Added bucket and hub repo permission verification before joining a run to ensure that the client can eventually checkpoint.
  • The run manager now expects the GCP bucket credentials to be present inside the SCRATCH_DIR in order to work correctly when using GCP checkpoints.
  • Updated the psyche book with an explanation of the new cooldown behavior.

@IAvecilla IAvecilla self-assigned this Dec 23, 2025
hub_repo
)
let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model;
if !self.skip_upload_check {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can invert the check and return early

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to execute the rest of the code in that function, so I don’t think we can return early there, unless you meant something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered pulling this out into a fn? this seems like something that could be deduplicated between the centralized and decentralized clients

Comment on lines +211 to +213
if skip_upload {
info!("Skipping checkpoint save and upload (skip_upload flag is set)");
checkpoint_completed.store(true, Ordering::SeqCst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return Ok(evals) early and you don't need to have an else branch here

Comment on lines 16 to 17
//#[error("GCS operation failed: {0}")]
//GcsStorage(#[from] google_cloud_storage::client::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we delete this?

@IAvecilla IAvecilla marked this pull request as ready for review January 23, 2026 19:53
hub_repo
)
let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model;
if !self.skip_upload_check {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered pulling this out into a fn? this seems like something that could be deduplicated between the centralized and decentralized clients

// Use HF_TOKEN from checkpoint_config for Hub uploads
if let Some(ref hub_token) = config.hub_token {
Some(UploadInfo::Hub(HubUploadInfo {
hub_repo: String::new(), // Will be validated when actual checkpoint is received
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a little strange to me, carrying a dummy hub repo around. any way to get this as some sort of enum with a filled/unfilled variant, or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally true. Since right now the only way to get the checkpoint is through the Coordinator, we have to update the source once the client connects and receives the updates. I refactored it to use an enum that carries the minimum required info at first, and then switches to the other variant once all the necessary information is available.


let (mut app, allowlist, p2p, state_options) =
build_app(cancel, server_addr, tx_tui_state, args)
build_app(cancel, server_addr, tx_tui_state, args, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_testing arg feels a little strange here, just passing a raw boolean.. but maybe this is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don’t like it either, but it was a quick way to avoid checking credentials and permissions for uploading to external storage. I tried another approach to avoid adding new flags everywhere.

model::Checkpoint::Hub(HubRepo { repo_id, revision })
| model::Checkpoint::P2P(HubRepo { repo_id, revision }) => {
Some(UploadInfo::Hub(HubUploadInfo {
hub_repo: (&repo_id).into(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, does the into() not autoref repo_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some refactors I made, I think this part is no longer needed. But yeah, I thought the same when using into, and it was kind of suggesting that approach.

"storage.objects.delete",
];

let resource = format!("projects/_/buckets/{}", gcs_info.gcs_bucket);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a lot of GCS-specific stuff to have in the main app file. have we considered making a tiny wrapper module in some other crate that makes this api a bit more like the hf one? or even a trait to abstract reading / writing checkpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved all the credential-specific logic into a new module, where I defined the UploadInfo and UploadCredentials structs, and we only call validate_credentials from there. Maybe the trait approach is even better, so that’s a good suggestion, I’ll try that. But at least this should be an improvement.


let upload_handle = tokio::task::spawn(async move {

let upload_info = match checkpoint {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of duplication of structure makes me wonder if we shouldn't have a separate enum for, like

enum HostedModel {
  Gcp { whatever }
  Hub { whatever }
}

and then a checkpoint would be

enum Checkpoint {
  Hosted(HostedModel)
  P2P(HostedModel)
}

just to remove this weird duplication of p2p / p2pgcs. thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I did this is that adding a new enum containing those variants would change the size of the Checkpoint struct, which would in turn change the Coordinator’s memory layout. I wanted to avoid a redeployment caused by storage changes. I much prefer the suggested approach, but to avoid dealing with those changes I went in this direction, does that make sense?

}
}

fn default_checkpoint_dir() -> PathBuf {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not write to somewhere in the scratch dir in the container, so all our files end up in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. I changed it to check whether we have the scratch directory and to keep all the local checkpoints there.

let seed = sha256(&round.random_seed.to_le_bytes());

let checkpointers = max(
(coordinator.epoch_state.clients.len() / 3).min(SOLANA_MAX_NUM_CHECKPOINTERS),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be correct if a client has requested to be withdrawn e.g. within warmup? (it might be, just sanity checking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that on every state transition we call move_clients_to_exited to check whether any clients are unhealthy and move them from epoch_state.clients to epoch_state.exited_clients, so that should handle it. Worst case, if we don’t move one, the idea is to select multiple checkpointers at the end of an epoch to increase the chances that at least one of them is healthy and successfully uploads the full model.

Hub(HubRepo),
P2P(HubRepo),
Gcs(GcsRepo),
P2P(HubRepo),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't re-ordering here mean that existing coordinators currently set to P2P will suddenly decode as Gcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! nice catch, I changed it to keep the original order 🙂

.arg("--env-file")
.arg(&self.env_file);

if let Some(dir) = &self.scratch_dir {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this assert that you MUST pass google credentials for any run started via run-manager, meaning we can't do e.g. hf checkpointing anymore?

also, i would much rather we find a way to pass the creds via some other mechanism - maybe write them to a temp dir the same way we write the env file, or something? adding another "you must have this file in this dir" thing is janky IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Also, thank you for catching that. I changed it to be optional and now we only mount the credentials into Docker if they exist locally, but this should support Hugging Face the same way as before.

I agree that asking to move the credentials to the scratch dir is kind of annoying. The only alternative I can think of is setting GOOGLE_CREDENTIALS_FILE_PATH in the .env to point to the local credential file, and then having the manager script automatically mount that file into the scratch dir and set the env variable. Does that sound better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that asking to move the credentials to the scratch dir is kind of annoying. The only alternative I can think of is setting GOOGLE_CREDENTIALS_FILE_PATH in the .env to point to the local credential file, and then having the manager script automatically mount that file into the scratch dir and set the env variable. Does that sound better?

Yeah, i think this would work better if that's cool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants