Skip to content

Conversation

@Kostusas
Copy link
Collaborator

@Kostusas Kostusas commented Oct 8, 2025

Why

  • Provide a supported path to run ISF on ADA HPC cluster via Dask‑Gateway.
  • Align with Dask’s cluster abstraction and, where appropriate, lean on gateway/cluster tooling instead of bespoke handlers.

Changes

  • Deps: added dask-gateway. This can be removed later, I just needed to know how to minimally bump several Dask‑adjacent pins so it installs cleanly. Kept changes conservative due to age of prior pins.
  • Interface: I.get_client(cluster=None, ...) now accepts a cluster. When given, it returns cluster.get_client(). If cluster is None, behavior is unchanged.
  • Lockfile: updated (pixi.lock).

How to use (short)

# obtain a gateway cluster (site-specific)
cluster = gw.new_cluster()  # or GatewayCluster(...)
dask_client = I.get_client(cluster=cluster) 

Hot to use on ADA

from dask_gateway import GatewayCluster
from dask_gateway.auth import BasicAuth

cluster = GatewayCluster("https://<gateway-endpoint>", auth=BasicAuth(username="...", password="..."), env='<pixi /bin/ path>', workdir='<ISF_dir>',  worker_threads=1) # worker_threads=1 important to not break some ISF calculations
cluster.scale(n_workers)                  # or cluster.adapt(minimum=0, maximum=n_workers)
dask_client =  I.get_client(cluster=cluster)

Backwards compatibility

  • No breaking changes unless bumping dask and flask breaks something . Existing flows that do not use clusters continue to work.

@Kostusas Kostusas requested a review from bgmeulem October 8, 2025 09:52
@Kostusas Kostusas self-assigned this Oct 8, 2025
@codecov
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.31%. Comparing base (b623cdb) to head (ac1506b).

Files with missing lines Patch % Lines
Interface/__init__.py 25.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #439   +/-   ##
=======================================
  Coverage   47.30%   47.31%           
=======================================
  Files         241      241           
  Lines       22404    22406    +2     
=======================================
+ Hits        10598    10601    +3     
+ Misses      11806    11805    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bgmeulem
Copy link
Collaborator

bgmeulem commented Oct 8, 2025

Hi Kostas!

Thanks for the PR. We pin down flask because of our old dask version. Until this PR, I was convinced we pinned down dask because of our old numpy and pnadas version, but that appears to be untrue, given that this PR just works...

I'm currently testing out some default workflows beyond the test suite locally on SOMA HPC. So far, eerything seems to be working just fine with the new dask and flask. I'll let you know once I'm through all tests.

@bgmeulem
Copy link
Collaborator

bgmeulem commented Oct 8, 2025

Testing status on SOMA HPC:

  • Initialize databases from raw simulation data
  • rerun simulations from db and parallellize across threads

@bgmeulem
Copy link
Collaborator

bgmeulem commented Oct 8, 2025

Tragically, simulation runs don't work on newer dask. Workers die for unknown reasons, and I can't figure out why. The logs are unavailable (since the worker is no more), and reproducing locally works again. I'll investigate more tomorrow, but bumping up dask can't be merged yet :(

@Kostusas
Copy link
Collaborator Author

Kostusas commented Oct 8, 2025

Could you provide a minimal code to reproduce the error and consequently test the PR? I'd be happy to help in trying to debug whenever I have time.

I am not sure if relevant, but I found before that ISF fails unless worker threads are set to 1. Sometimes dask clusters default to 2 and it breaks everything, so its probably some race condition.

Also, the logs for the workers still exist, we just gotta looker harder haha.

@bgmeulem
Copy link
Collaborator

bgmeulem commented Oct 9, 2025

I mailed you a sample database with 5 trials to test on.

MWE:

import Interface as I
I.logger.setLevel("DEBUG")
db = I.DataBase("path/to/extracted_db")
sample_indices = db['sim_trial_index']  # there's just 5 in here
ds = I.simrun_rerun_db(
    db=db,
    outdir="reproduce_db_init_test",
    stis=sample_indices,
)  # returns delayed objects to compute with scheduler
client = I.get_client()  # you will likely need to adapt this
r = c.gather(c.compute(ds))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants