Architecting Asynchronous Simulations: Handling Long-Running Phreeqc Tasks in a Web Service #23
Closed
Knguyen-dev
started this conversation in
General
Replies: 1 comment
-
|
Not needed anymore, everything is in #38 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Intro
A simulation for Phreeqc can take from a couple of seconds to up to 20 minutes to yield results, depending on the actual inputs. This is an issue because HTTP requests time out after a few seconds, so keeping the user's request in memory for that amount of time would just timeout the request. To fix this issue, we'll have to use different methods and techniques other than a basic synchronous request-response flow.
When the user does
POST /api/phreeqc, assuming that their input is correct, we'll run the binary asynchronously. So even if the request is finished, the simulation will still be running on our server. After we start the Phreeqc job, we need to make sure that we return theexperiment_idassociated with the experiment. This is crucial for later, as it's used to identify the specific experiment simulation that was run, and also the location of the files related to that experiment. The user should then be redirected by the app to go to some results page like/phreeqc/results/:experiment_idor something similar. Here, the client will occasionally poll our backend to see if the simulation has finished using the endpoint undercheck_status. Once the phreeqc process is listed in a finished state, we'll send back the data to that page. That's the entire high-level overview of what happens. For other binaries and simulations that are long-running, we're going to use this method to get things done. This is all done inphreeqc_interceptor, the high-level route handler, but let's explore how we're running phreeqc asynchronously withinprocess_manager.py.When starting a simulation, we first create a
.lockfilethat will contain the status of the process that's currently running. We'll use multi-threading by creating a separate thread for runningrun_phreeqc_process. This separate thread handles updating the experiment's status to things like "running" (via modifying the.lockfile), and also handles sending (piping) data to the subprocess. Based on various scenarios, we write to that.lockfilewith different statuses:phreeqcthis is about 20 minutes.We keep a dictionary
running_processesthat's used to keep track all the running phreeqc simulations. It contains the reference to a subprocess being run, when the simulation started, and the location of the.lockfileassociated with the experiment. Its main use is so that we have references to all running phreeqc subprocesses, and through this dictionary, we can kill and clean up simulations that are running longer than expected. In the code where we start PhreeQC as a subprocess, if PhreeQC takes more than 20 minutes to complete, then thesubprocesslibrary will kill the process. As a result, we'll remove the process that's recorded inrunning_processes. However, if we run into an unrelated error and exit the function early, then that binary will keep running, not hitting that timeout catch block. However, at least we still have that process recorded in our dictionary! To cover cases like this, we periodically run a job to clean up and kill any long-running phreeqc processes:1. Using the process's start time, if it's longer than 20 minutes, then we will kill that process.
2. Update the lock file to indicate the process timed out.
running_processesto make sure we're correctly keeping track of all running processes only. Again, one of the main purposes of this is that it lets us kill processes that are running longer than usual. Its purpose is different from the.lockfile, which tracks the status of the simulation, even after it's completed or timed out.After getting information that their simulation is working, we expose 3 important endpoints to the client;
GET /api/phreeqc/status/:experiment_id: Gives the frontend a way to check the status of a job. This is the main function that makes use of that.lockfilethat we keep saved as a log of the results of a given simulation.GET /api/phreeqc/result/:experiment_id: Gives the frontend the status of the results of phreeqc, it makes use of reading the experiment's.lockfile.GET /api/phreeqc/download/:experiment_id: Downloads the zip file that was created for this experiment.Other than that, this is the whole technical explanation behind how we handle long-running requests involving these binaries. We'll be using the exact same technique when dealing with Supcrtbl.
Note:
thread.daemon = true: By default, threads in Python are non-daemons, meaning they will prevent the program from exiting until they finish running. This is useful when you need threads to do critical work that must be completed before the program ends. However, in many cases, you don't want the program to wait for background threads to finish. By settingdaemon=True, you are telling the Python interpreter to allow the main program to exit even if the daemon threads are still running. This is great for background tasks. In our code, we have a cleanup thread that periodically checks long-running processes. This is a background task that runs independently of the main program. It's not critical to the main program flow, so set it as a daemon true so it doesn't block the main program. TLDR, setdaemon=Trueto ensure they run in the background without preventing the program from exiting.nextTDB_workdirsonce a week or something.Beta Was this translation helpful? Give feedback.
All reactions