Stability track scope review for the current local-use model #301

SeaCelo · 2026-03-11T15:37:38Z

SeaCelo
Mar 11, 2026
Maintainer

There are several open issues and PRs in the repository that are addressing what we have been loosely calling “stability” issues.

Over the last few days, I spent time going through that cluster as a group: reading the issues and PRs, reading the comment threads, looking at how the proposals relate to each other, and checking how they line up with the code and with the way MUIOGO is actually used today.

My conclusion is that this queue currently mixes together several different kinds of work that are not equally urgent and do not all solve the same kind of problem. Some of these items are trying to add safety guardrails to the current synchronous app. Some are proposing a larger async execution architecture. Others are adding supporting infrastructure around that future architecture, like run IDs, status files, atomic writes, or run metadata.

After looking through the whole cluster, my current assessment is:

most of these items are not clearly fixing a present-day problem in the current local single-user synchronous usage model
several of them seem more like preparation for a future async/polling/job-management architecture than responses to a bug users are actually hitting today
the only items that seem to have any plausible current local-stability value are:
- #258, but only if narrowed all the way down to crash-safe atomic writes for resData.json
- #259 / #257, but only if reframed around some future non-destructive “still running” visibility, not timeout enforcement

Everything else in this cluster currently looks, to me, like future architecture or supporting infrastructure rather than something we need to treat as an active product problem right now.

So my current conclusion is that we should probably reduce this queue rather than keep carrying all of these items open by default.

That said, I do not want to close work that has a real reason to stay open. If there is a strong case for keeping one of these items alive under the current local-use model, I want to hear it before making that call.

The items I am reviewing in that light are:

My proposal is:

treat these items as closure candidates unless there is a concrete reason to keep them open for the product we have now
if an item is worth keeping, narrow it immediately to the smallest slice that actually matters under the current local-use model
avoid keeping broad future-facing async/framework work open just because it may be useful later

More specifically:

#174 / #186, #114 / #146, #188 / #191, #209 / #210, and #235 currently look like async architecture, run identity, polling, status-tracking, or observability layers. Right now I do not see a strong reason to keep them open unless someone can point to a real present-day local-use stability problem they solve.
#258 only seems worth keeping if the case is specifically crash-safe atomic writes for resData.json. I do not want to keep open a locking framework or a broad AtomicIO abstraction just for future possibilities.
#259 / #257 only seems worth keeping if the case is some future run-status visibility. I do not currently want timeout-enforcement behavior such as a hard timeout counter or automatic kill-after-N-seconds logic.

If you think one of these items should remain open, please reply in this discussion and explain:

What current local single-user stability problem does it solve?
What current bug or failure mode has actually been observed?
If the proposal is too broad, what is the smallest slice that is still worth keeping?

Please ground the answer in the current local-use model. Future Track 1 / OG-Core preparation, generic async architecture, speculative concurrency, “infrastructure just in case,” or observability without a current user pain point are not enough on their own.

I plan to review the responses here and then decide which of these items should stay open, which should be narrowed, and which should be closed.

If something does survive, my current view is that it should be narrowed immediately:

#258 -> crash-safe atomic write hardening only
#259 -> possible future run-status visibility only, with no timeout enforcement

I’m doing this as one discussion first rather than posting separate comments on every issue and PR, because I want to get the high-level scope question straight before making item-by-item cleanup decisions.

cc: @brightyorcerf
@Adityakushwaha2006
@parthdagia05
@sanvishukla
@NamanmeetSingh
@krishivsaini
@autibet

brightyorcerf · 2026-03-11T15:54:41Z

brightyorcerf
Mar 11, 2026

@SeaCelo Regarding #258, I agree with this triage. Narrowing the scope is the right move to keep the local-use model stable without adding unnecessary overhead.

I’ve stripped away the broader locking framework and implemented an os.replace pattern to ensure crash-safe atomic writes for resData.json. This implementation includes os.fsync to ensure the primary state file is never corrupted, even in the event of a power failure or system crash mid-write.

PR: #302

0 replies

sanvishukla · 2026-03-11T16:27:21Z

sanvishukla
Mar 11, 2026

Hi @SeaCelo,

Thank you for taking the time to review this entire cluster of issues and PRs so thoroughly.

I completely agree with your assessment. When I originally opened PR #191 and #210 , I was exploring ways to establish run-level identity, but I see now how this actively pulls the codebase toward an asynchronous/job-polling architecture that MUIOGO simply doesn't need for its current local, synchronous use-cases. Building speculative infrastructure "just in case" is an anti-pattern, and I appreciate you guarding the codebase against that bloat.
I am fully on board with closing PR #191 and #210.

0 replies

parthdagia05 · 2026-03-11T17:21:00Z

parthdagia05
Mar 11, 2026

For #259 / #257 the motivation was specifically around the case where the solver subprocess never returns due to an unexpected state (for example an infinite loop). In the current synchronous Flask execution model, that situation causes the request that invoked the solver to remain blocked indefinitely because subprocess.run() waits until the process exits. The timeout was introduced as to prevent a situation where a solver process hangs permanently and leaves the API request stuck with no feedback. The default value was set very high and controlled through an environment variable so that deployments could adjust or effectively disable it if long-running jobs are expected. I’m open to narrowing the idea toward something closer to what you mentioned indicating that the solver is still running rather than terminating the process .Happy to adjust or narrow the proposal if there is a better approach.
@SeaCelo

0 replies

NamanmeetSingh · 2026-03-11T17:35:48Z

NamanmeetSingh
Mar 11, 2026

@SeaCelo I completely agree with this aggressive triage. A standalone desktop app does not need complex job-polling frameworks, and I fully support stripping away the bloat.

However, building on what @parthdagia05 just highlighted regarding API blocking, I want to make the case for keeping a heavily narrowed slice of #114 alive. There is a critical, present-day failure mode in the current local-use model that is purely caused by the synchronous HTTP lock: The Browser Timeout / Zombie Process Loop.

The Current Observed Bug (Local-Use Model):

A user on a lower-spec machine triggers a standard OSeMOSYS run.
Because execution is synchronous, the Flask event loop blocks waiting for subprocess.run().
If the local solver takes longer than ~2 minutes (common on older hardware), the user's browser enforces a hard network timeout (HTTP 504 / ERR_EMPTY_RESPONSE).
The UI breaks. Assuming the app has frozen, the user refreshes the page and clicks "Run" again.
The original solver subprocess is now an orphaned "zombie" silently draining CPU. The new click spawns a second competing solver process, rapidly leading to OOM (Out of Memory) crashes and corrupted resData.json writes.

The Narrowed Scope for #114:
We do not need rich metadata, external databases, or heavy TaskManagers right now. The absolute smallest slice worth keeping is simply decoupling the HTTP request from the solver subprocess: returning an immediate 202 Accepted so the browser connection can close gracefully. This prevents the network timeout and the subsequent user-triggered zombie loop.

I propose we strip all Track 1 / Future preparation out of #114 and constrain it purely to this bare-minimum HTTP decoupling to protect local-use stability.

0 replies

krishivsaini · 2026-03-11T19:08:39Z

krishivsaini
Mar 11, 2026

@SeaCelo I agree with the aggressive triage and fully support narrowing scope.
To support @NamanmeetSingh case for the narrowed #114: I built a proof-of-concept at https://modelpulse.onrender.com/ that demonstrates the bare-minimum HTTP decoupling pattern he described immediate 202 response, zero external dependencies, standard library only. The implementation is ~50 lines using ThreadPoolExecutor.
If the maintainers decide a narrowed version of #114 is worth keeping, this pattern is ready to integrate. If the decision is to close it, I'm happy to redirect my efforts toward whatever the team prioritizes next.

0 replies

Adityakushwaha2006 · 2026-03-11T22:11:46Z

Adityakushwaha2006
Mar 11, 2026

Hey @SeaCelo ,

The current implementation #174 / #186 is definitely broader than what the local use model needs right now ,it was scoped toward a future async architecture rather than a present day stability fix.
Happy to close #186.

0 replies

SeaCelo · 2026-03-12T20:04:09Z

SeaCelo
Mar 12, 2026
Maintainer Author

Thanks for the careful responses everyone. It seems we are broadly on the same page and I even got some additional ideas from this thread. For example, @brightyorcerf work on atomic writes made me think of how an interrupted run (close your laptop, sleep, etc) can be continued. This is actually similar to #114, I think. It would be a nice addition to allow a run to be paused or to be gracefully interrupted and continued. I don't actually run the Osemosys model much so I never had to try. @autibet may have experience here, but I think these two PRs would be important for this as well. I'll give it a careful think and if anyone of you already have thoughts on how this could be implemented with minimal architectural change (or if its already possible but broken), this would be a good new issue.

I'll be away for a few days so don't expect a lot of communication from me for a bit.

0 replies

parthdagia05 · 2026-03-13T03:46:03Z

parthdagia05
Mar 13, 2026

@SeaCelo
Right now the solver is executed through subprocess.run(), which blocks until the process exits. Because of that we don't really have any visibility or control over the process once it starts. One small step that might help here would be switching to subprocess.Popen instead. That would allow us to check whether the solver process is still running (poll()), optionally send signals (SIGINT / SIGTERM) for graceful interruption, avoid introducing a full async task/job framework. The motivation behind #259/#257 was originally the case where the solver process never returns and the request remains blocked indefinitely. But if timeout enforcement is not desirable for the local-use model, the PR could potentially be narrowed toward basic process monitoring/control only, which might also support the kind of interruption behavior you're describing.

0 replies

krishivsaini · 2026-03-13T10:50:49Z

krishivsaini
Mar 13, 2026

@SeaCelo the run interruption/continuation idea is really interesting. Building on @parthdagia05's suggestion of subprocess.Popen for basic process monitoring: a minimal implementation could store the solver PID alongside the case directory, check it on the next /run request, and offer the user a choice to cancel-and-restart or wait. No async framework needed just a PID file and a poll() check. Happy to open a focused issue for this if it would be useful.

0 replies

Stability track scope review for the current local-use model #301

Uh oh!

SeaCelo Mar 11, 2026 Maintainer

Replies: 9 comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SeaCelo Mar 12, 2026 Maintainer Author

Uh oh!

Uh oh!

SeaCelo
Mar 11, 2026
Maintainer

SeaCelo
Mar 12, 2026
Maintainer Author