Stability track scope review for the current local-use model #301
Replies: 9 comments
-
|
@SeaCelo Regarding #258, I agree with this triage. Narrowing the scope is the right move to keep the local-use model stable without adding unnecessary overhead. I’ve stripped away the broader locking framework and implemented an os.replace pattern to ensure crash-safe atomic writes for resData.json. This implementation includes os.fsync to ensure the primary state file is never corrupted, even in the event of a power failure or system crash mid-write. PR: #302 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @SeaCelo, Thank you for taking the time to review this entire cluster of issues and PRs so thoroughly. I completely agree with your assessment. When I originally opened PR #191 and #210 , I was exploring ways to establish run-level identity, but I see now how this actively pulls the codebase toward an asynchronous/job-polling architecture that MUIOGO simply doesn't need for its current local, synchronous use-cases. Building speculative infrastructure "just in case" is an anti-pattern, and I appreciate you guarding the codebase against that bloat. |
Beta Was this translation helpful? Give feedback.
-
|
For #259 / #257 the motivation was specifically around the case where the solver subprocess never returns due to an unexpected state (for example an infinite loop). In the current synchronous Flask execution model, that situation causes the request that invoked the solver to remain blocked indefinitely because subprocess.run() waits until the process exits. The timeout was introduced as to prevent a situation where a solver process hangs permanently and leaves the API request stuck with no feedback. The default value was set very high and controlled through an environment variable so that deployments could adjust or effectively disable it if long-running jobs are expected. I’m open to narrowing the idea toward something closer to what you mentioned indicating that the solver is still running rather than terminating the process .Happy to adjust or narrow the proposal if there is a better approach. |
Beta Was this translation helpful? Give feedback.
-
|
@SeaCelo I completely agree with this aggressive triage. A standalone desktop app does not need complex job-polling frameworks, and I fully support stripping away the bloat. However, building on what @parthdagia05 just highlighted regarding API blocking, I want to make the case for keeping a heavily narrowed slice of #114 alive. There is a critical, present-day failure mode in the current local-use model that is purely caused by the synchronous HTTP lock: The Browser Timeout / Zombie Process Loop. The Current Observed Bug (Local-Use Model):
The Narrowed Scope for #114: I propose we strip all Track 1 / Future preparation out of #114 and constrain it purely to this bare-minimum HTTP decoupling to protect local-use stability. |
Beta Was this translation helpful? Give feedback.
-
|
@SeaCelo I agree with the aggressive triage and fully support narrowing scope. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @SeaCelo , The current implementation #174 / #186 is definitely broader than what the local use model needs right now ,it was scoped toward a future async architecture rather than a present day stability fix. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the careful responses everyone. It seems we are broadly on the same page and I even got some additional ideas from this thread. For example, @brightyorcerf work on atomic writes made me think of how an interrupted run (close your laptop, sleep, etc) can be continued. This is actually similar to #114, I think. It would be a nice addition to allow a run to be paused or to be gracefully interrupted and continued. I don't actually run the Osemosys model much so I never had to try. @autibet may have experience here, but I think these two PRs would be important for this as well. I'll give it a careful think and if anyone of you already have thoughts on how this could be implemented with minimal architectural change (or if its already possible but broken), this would be a good new issue. I'll be away for a few days so don't expect a lot of communication from me for a bit. |
Beta Was this translation helpful? Give feedback.
-
|
@SeaCelo |
Beta Was this translation helpful? Give feedback.
-
|
@SeaCelo the run interruption/continuation idea is really interesting. Building on @parthdagia05's suggestion of subprocess.Popen for basic process monitoring: a minimal implementation could store the solver PID alongside the case directory, check it on the next /run request, and offer the user a choice to cancel-and-restart or wait. No async framework needed just a PID file and a poll() check. Happy to open a focused issue for this if it would be useful. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
There are several open issues and PRs in the repository that are addressing what we have been loosely calling “stability” issues.
Over the last few days, I spent time going through that cluster as a group: reading the issues and PRs, reading the comment threads, looking at how the proposals relate to each other, and checking how they line up with the code and with the way MUIOGO is actually used today.
My conclusion is that this queue currently mixes together several different kinds of work that are not equally urgent and do not all solve the same kind of problem. Some of these items are trying to add safety guardrails to the current synchronous app. Some are proposing a larger async execution architecture. Others are adding supporting infrastructure around that future architecture, like run IDs, status files, atomic writes, or run metadata.
After looking through the whole cluster, my current assessment is:
#258, but only if narrowed all the way down to crash-safe atomic writes forresData.json#259 / #257, but only if reframed around some future non-destructive “still running” visibility, not timeout enforcementEverything else in this cluster currently looks, to me, like future architecture or supporting infrastructure rather than something we need to treat as an active product problem right now.
So my current conclusion is that we should probably reduce this queue rather than keep carrying all of these items open by default.
That said, I do not want to close work that has a real reason to stay open. If there is a strong case for keeping one of these items alive under the current local-use model, I want to hear it before making that call.
The items I am reviewing in that light are:
My proposal is:
More specifically:
#174 / #186,#114 / #146,#188 / #191,#209 / #210, and#235currently look like async architecture, run identity, polling, status-tracking, or observability layers. Right now I do not see a strong reason to keep them open unless someone can point to a real present-day local-use stability problem they solve.#258only seems worth keeping if the case is specifically crash-safe atomic writes forresData.json. I do not want to keep open a locking framework or a broadAtomicIOabstraction just for future possibilities.#259 / #257only seems worth keeping if the case is some future run-status visibility. I do not currently want timeout-enforcement behavior such as a hard timeout counter or automatic kill-after-N-seconds logic.If you think one of these items should remain open, please reply in this discussion and explain:
Please ground the answer in the current local-use model. Future Track 1 / OG-Core preparation, generic async architecture, speculative concurrency, “infrastructure just in case,” or observability without a current user pain point are not enough on their own.
I plan to review the responses here and then decide which of these items should stay open, which should be narrowed, and which should be closed.
If something does survive, my current view is that it should be narrowed immediately:
#258-> crash-safe atomic write hardening only#259-> possible future run-status visibility only, with no timeout enforcementI’m doing this as one discussion first rather than posting separate comments on every issue and PR, because I want to get the high-level scope question straight before making item-by-item cleanup decisions.
cc: @brightyorcerf
@Adityakushwaha2006
@parthdagia05
@sanvishukla
@NamanmeetSingh
@krishivsaini
@autibet
Beta Was this translation helpful? Give feedback.
All reactions