Cooldown state rework by IAvecilla · Pull Request #455 · PsycheFoundation/nousnet

IAvecilla · 2025-12-23T20:56:17Z

Depends on #463

This PR contains several changes to client arguments, dependencies, and the general cooldown state logic. Here’s a summary of the changes made in this PR:

All clients are now potential checkpointers and must be prepared to perform that task. At the end of an epoch, during the Cooldown state, a subset of clients (1/3 of the total clients, capped at SOLANA_MAX_NUM_CHECKPOINTERS, which is currently 16) is deterministically selected as checkpointers using a seeded shuffle algorithm based on the round’s random seed.
Cooldown is now an opportunistic state, just like the others. Once a client uploads the entire model to external storage, it sends a transaction that moves the run forward to the next epoch. If no transaction is received, the coordinator waits for the full cooldown_time before moving on.
Added cancellation support so that once one client successfully checkpoints, the others can cancel their upload tasks and stop uploading in order to move to the next epoch.
The --hub-repo and --gcs-bucket flags were removed from the training arguments. Checkpoint destinations are now derived from the coordinator configuration, and there is a single repo or bucket that contains all the checkpoints for the run.
Added a new client flag, --skip-checkpoint-upload, for testing purposes. This avoids uploading and saving checkpoints locally and should not be used outside testing environments.
Updated the google-cloud-storage crate to version 1.6.
Added bucket and hub repo permission verification before joining a run to ensure that the client can eventually checkpoint.
The run manager now expects the GCP bucket credentials to be present inside the SCRATCH_DIR in order to work correctly when using GCP checkpoints.
Updated the psyche book with an explanation of the new cooldown behavior.

shared/coordinator/src/coordinator.rs

shared/client/src/cli.rs

entropidelic · 2026-01-21T20:18:30Z

architectures/centralized/client/src/app.rs

-                        hub_repo
-                    )
+        let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model;
+        if !self.skip_upload_check {


maybe we can invert the check and return early

We have to execute the rest of the code in that function, so I don’t think we can return early there, unless you meant something else?

considered pulling this out into a fn? this seems like something that could be deduplicated between the centralized and decentralized clients

entropidelic · 2026-01-21T20:38:14Z

shared/client/src/state/cooldown.rs

+                    if skip_upload {
+                        info!("Skipping checkpoint save and upload (skip_upload flag is set)");
+                        checkpoint_completed.store(true, Ordering::SeqCst);


you can return Ok(evals) early and you don't need to have an else branch here

entropidelic · 2026-01-23T14:42:30Z

shared/data-provider/src/errors.rs

+    //#[error("GCS operation failed: {0}")]
+    //GcsStorage(#[from] google_cloud_storage::client::Error),


should we delete this?

arilotter · 2026-02-04T16:41:04Z

architectures/centralized/client/src/app.rs

-                        hub_repo
-                    )
+        let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model;
+        if !self.skip_upload_check {


considered pulling this out into a fn? this seems like something that could be deduplicated between the centralized and decentralized clients

arilotter · 2026-02-04T16:42:30Z

architectures/centralized/client/src/app.rs

+                    // Use HF_TOKEN from checkpoint_config for Hub uploads
+                    if let Some(ref hub_token) = config.hub_token {
+                        Some(UploadInfo::Hub(HubUploadInfo {
+                            hub_repo: String::new(), // Will be validated when actual checkpoint is received


this feels a little strange to me, carrying a dummy hub repo around. any way to get this as some sort of enum with a filled/unfilled variant, or something?

Totally true. Since right now the only way to get the checkpoint is through the Coordinator, we have to update the source once the client connects and receives the updates. I refactored it to use an enum that carries the minimum required info at first, and then switches to the other variant once all the necessary information is available.

arilotter · 2026-02-04T16:43:10Z

architectures/centralized/client/src/main.rs


            let (mut app, allowlist, p2p, state_options) =
-                build_app(cancel, server_addr, tx_tui_state, args)
+                build_app(cancel, server_addr, tx_tui_state, args, false)


is_testing arg feels a little strange here, just passing a raw boolean.. but maybe this is fine

Yeah, I don’t like it either, but it was a quick way to avoid checking credentials and permissions for uploading to external storage. I tried another approach to avoid adding new flags everywhere.

arilotter · 2026-02-04T16:45:36Z

architectures/decentralized/solana-client/src/app.rs

+            model::Checkpoint::Hub(HubRepo { repo_id, revision })
+            | model::Checkpoint::P2P(HubRepo { repo_id, revision }) => {
+                Some(UploadInfo::Hub(HubUploadInfo {
+                    hub_repo: (&repo_id).into(),


weird, does the into() not autoref repo_id?

With some refactors I made, I think this part is no longer needed. But yeah, I thought the same when using into, and it was kind of suggesting that approach.

arilotter · 2026-02-04T16:46:42Z

architectures/decentralized/solana-client/src/app.rs

+                    "storage.objects.delete",
+                ];
+
+                let resource = format!("projects/_/buckets/{}", gcs_info.gcs_bucket);


this seems like a lot of GCS-specific stuff to have in the main app file. have we considered making a tiny wrapper module in some other crate that makes this api a bit more like the hf one? or even a trait to abstract reading / writing checkpoints?

I moved all the credential-specific logic into a new module, where I defined the UploadInfo and UploadCredentials structs, and we only call validate_credentials from there. Maybe the trait approach is even better, so that’s a good suggestion, I’ll try that. But at least this should be an improvement.

arilotter · 2026-02-04T17:10:12Z

shared/client/src/state/cooldown.rs

-
-                let upload_handle = tokio::task::spawn(async move {
+
+                    let upload_info = match checkpoint {


this kind of duplication of structure makes me wonder if we shouldn't have a separate enum for, like

enum HostedModel { Gcp { whatever } Hub { whatever } }

and then a checkpoint would be

enum Checkpoint { Hosted(HostedModel) P2P(HostedModel) }

just to remove this weird duplication of p2p / p2pgcs. thoughts?

The only reason I did this is that adding a new enum containing those variants would change the size of the Checkpoint struct, which would in turn change the Coordinator’s memory layout. I wanted to avoid a redeployment caused by storage changes. I much prefer the suggested approach, but to avoid dealing with those changes I went in this direction, does that make sense?

arilotter · 2026-02-04T17:22:00Z

shared/client/src/cli.rs

    }
 }

+fn default_checkpoint_dir() -> PathBuf {


should this not write to somewhere in the scratch dir in the container, so all our files end up in the same place?

Yep, good catch. I changed it to check whether we have the scratch directory and to keep all the local checkpoints there.

arilotter · 2026-02-04T17:24:56Z

shared/coordinator/src/checkpointer.rs

+        let seed = sha256(&round.random_seed.to_le_bytes());
+
+        let checkpointers = max(
+            (coordinator.epoch_state.clients.len() / 3).min(SOLANA_MAX_NUM_CHECKPOINTERS),


will this be correct if a client has requested to be withdrawn e.g. within warmup? (it might be, just sanity checking)

I think that on every state transition we call move_clients_to_exited to check whether any clients are unhealthy and move them from epoch_state.clients to epoch_state.exited_clients, so that should handle it. Worst case, if we don’t move one, the idea is to select multiple checkpointers at the end of an epoch to increase the chances that at least one of them is healthy and successfully uploads the full model.

arilotter · 2026-02-04T17:28:57Z

shared/coordinator/src/model.rs

    Hub(HubRepo),
-    P2P(HubRepo),
    Gcs(GcsRepo),
+    P2P(HubRepo),


doesn't re-ordering here mean that existing coordinators currently set to P2P will suddenly decode as Gcs?

YES! nice catch, I changed it to keep the original order 🙂

arilotter · 2026-02-04T17:33:24Z

tools/rust-tools/run-manager/src/docker/manager.rs

            .arg("--env-file")
            .arg(&self.env_file);

        if let Some(dir) = &self.scratch_dir {


doesn't this assert that you MUST pass google credentials for any run started via run-manager, meaning we can't do e.g. hf checkpointing anymore?

also, i would much rather we find a way to pass the creds via some other mechanism - maybe write them to a temp dir the same way we write the env file, or something? adding another "you must have this file in this dir" thing is janky IMO

Yes! Also, thank you for catching that. I changed it to be optional and now we only mount the credentials into Docker if they exist locally, but this should support Hugging Face the same way as before.

I agree that asking to move the credentials to the scratch dir is kind of annoying. The only alternative I can think of is setting GOOGLE_CREDENTIALS_FILE_PATH in the .env to point to the local credential file, and then having the manager script automatically mount that file into the scratch dir and set the env variable. Does that sound better?

I agree that asking to move the credentials to the scratch dir is kind of annoying. The only alternative I can think of is setting GOOGLE_CREDENTIALS_FILE_PATH in the .env to point to the local credential file, and then having the manager script automatically mount that file into the scratch dir and set the env variable. Does that sound better?

Yeah, i think this would work better if that's cool :)

IAvecilla added 5 commits December 23, 2025 11:39

Make cooldown opportunistic

0836df8

Fix some things

80550e6

Fix compilation

be24bda

End epoch after checkpointer

20ac8a2

Make hub_repo checkpoint mandatory

514c993

IAvecilla self-assigned this Dec 23, 2025

IAvecilla and others added 21 commits January 5, 2026 10:18

Use client as checkpointer and pick random

3553f7d

Reuse opportunistic logic

00b97f1

Add anyhow to cargo lock

0566a4a

Fix opportunistic cooldown message

203d0c4

Add random index for client

e8b1dd8

Remove rand and get checkpointers at cooldown state

0b6e9f5

Rework on cooldown to select pseudo random

c001fa5

Fix clippy

ef3b468

Add support for multiple checkpointers

8d8c3f4

Fix lint

410c192

Update nano config

3ee7cb6

Remove send_checkpoint function from backend

2b07c7d

Reduce total amount of checkpointers

7fb8add

Remove check for permissions on hf repo

e7421ed

Merge branch 'main' into cooldown-rework

a215881

Fix compilation error after merge

a66f089

Use convert function only on python trainer

5d99bf0

Add conditional import for python feature

2f77f46

Fix decentralized integration test config and entrypoints

284ae5d

fix bug and remove debug prints

ba70ece

remove unnecesary code

f82e387

entropidelic reviewed Jan 9, 2026

View reviewed changes

shared/coordinator/src/coordinator.rs Outdated Show resolved Hide resolved

entropidelic reviewed Jan 9, 2026

View reviewed changes

shared/coordinator/src/coordinator.rs Outdated Show resolved Hide resolved

dsocolobsky reviewed Jan 9, 2026

View reviewed changes

shared/client/src/cli.rs Outdated Show resolved Hide resolved

IAvecilla added 4 commits January 20, 2026 16:52

Remove NoCheckpoint variant and use a client flag instead

da2fab6

Send cooldown witness on skip checkpoint

d6a1201

Skip local save with skip upload flag

816ba5b

Merge branch 'cooldown-rework' into run-manager/gcs-credentials

3b83b2c

entropidelic reviewed Jan 21, 2026

View reviewed changes

entropidelic reviewed Jan 23, 2026

View reviewed changes

IAvecilla marked this pull request as ready for review January 23, 2026 19:53

IAvecilla added 5 commits January 26, 2026 11:15

Remove comment and add early return on skip upload check

c6016b7

Fix centralized permission check to upload

98c1abe

Fix cooldown checks and update comment on test flag

7d8bf2c

Merge branch 'gcs-2' into cooldown-rework

fb4810d

Merge branch 'gcs-2' into cooldown-rework

0134d01

pefontana mentioned this pull request Jan 30, 2026

Add parallelism_auto flag to automatically set dp, tp and micro batch size #516

Open

arilotter reviewed Feb 4, 2026

View reviewed changes

IAvecilla added 15 commits February 4, 2026 15:31

Refactor credentials validation to checkpoint destination

92097ba

Refactor credentials management for external sources

a0bd207

Remove is_test flag for testing

83a3ed8

Refactor hub repo and gcs bucket type

c6017de

Save local checkpoints inside scratch dir for docker containers

9ddffcf

Fix checkpoint enum order

1108f4a

Avoid checking for gcs credentials in run manager everytime

19db71d

Merge branch 'main' into cooldown-rework

fa96d31

Merge branch 'gcs-2' into cooldown-rework

2d025cc

Fix import error

c13d521

Fix clippy

ad8276d

Refactor send opportunistic witness function

01d196d

Use Google credentials path in env file to mount in run manager

540d352

Fix get data from bucket without credentials

9cdf792

Fix credentials check in testing run

8923d3e

		//#[error("GCS operation failed: {0}")]
		//GcsStorage(#[from] google_cloud_storage::client::Error),


		let upload_handle = tokio::task::spawn(async move {

		let upload_info = match checkpoint {

Conversation

IAvecilla commented Dec 23, 2025 • edited by entropidelic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

IAvecilla commented Dec 23, 2025 •

edited by entropidelic

Loading