Architecting Asynchronous Simulations: Handling Long-Running Phreeqc Tasks in a Web Service #23

Knguyen-dev · 2025-05-04T10:42:01Z

Knguyen-dev
May 4, 2025
Collaborator

Intro

A simulation for Phreeqc can take from a couple of seconds to up to 20 minutes to yield results, depending on the actual inputs. This is an issue because HTTP requests time out after a few seconds, so keeping the user's request in memory for that amount of time would just timeout the request. To fix this issue, we'll have to use different methods and techniques other than a basic synchronous request-response flow.

When the user does POST /api/phreeqc, assuming that their input is correct, we'll run the binary asynchronously. So even if the request is finished, the simulation will still be running on our server. After we start the Phreeqc job, we need to make sure that we return the experiment_id associated with the experiment. This is crucial for later, as it's used to identify the specific experiment simulation that was run, and also the location of the files related to that experiment. The user should then be redirected by the app to go to some results page like /phreeqc/results/:experiment_id or something similar. Here, the client will occasionally poll our backend to see if the simulation has finished using the endpoint under check_status. Once the phreeqc process is listed in a finished state, we'll send back the data to that page. That's the entire high-level overview of what happens. For other binaries and simulations that are long-running, we're going to use this method to get things done. This is all done in phreeqc_interceptor, the high-level route handler, but let's explore how we're running phreeqc asynchronously within process_manager.py.

When starting a simulation, we first create a .lockfile that will contain the status of the process that's currently running. We'll use multi-threading by creating a separate thread for running run_phreeqc_process. This separate thread handles updating the experiment's status to things like "running" (via modifying the .lockfile), and also handles sending (piping) data to the subprocess. Based on various scenarios, we write to that .lockfile with different statuses:

starting: We're beginning to start the process. We haven't run the binary yet, but we're still setting up the separate thread to run the process.
error: The process fails. Happens when the binary itself fails, returns a non-zero return code, or some other error.
completed: The binary is done running.
timeout: This happens when the simulation takes longer than what we've permitted. For phreeqc this is about 20 minutes.

We keep a dictionary running_processes that's used to keep track all the running phreeqc simulations. It contains the reference to a subprocess being run, when the simulation started, and the location of the .lockfile associated with the experiment. Its main use is so that we have references to all running phreeqc subprocesses, and through this dictionary, we can kill and clean up simulations that are running longer than expected. In the code where we start PhreeQC as a subprocess, if PhreeQC takes more than 20 minutes to complete, then the subprocess library will kill the process. As a result, we'll remove the process that's recorded in running_processes. However, if we run into an unrelated error and exit the function early, then that binary will keep running, not hitting that timeout catch block. However, at least we still have that process recorded in our dictionary! To cover cases like this, we periodically run a job to clean up and kill any long-running phreeqc processes:

Iterate through all pairs (active processes) in the dictionary
1. Using the process's start time, if it's longer than 20 minutes, then we will kill that process.
2. Update the lock file to indicate the process timed out.
For all expired or finished processes, we'll remove them from running_processes to make sure we're correctly keeping track of all running processes only. Again, one of the main purposes of this is that it lets us kill processes that are running longer than usual. Its purpose is different from the .lockfile, which tracks the status of the simulation, even after it's completed or timed out.

After getting information that their simulation is working, we expose 3 important endpoints to the client;

GET /api/phreeqc/status/:experiment_id: Gives the frontend a way to check the status of a job. This is the main function that makes use of that .lockfile that we keep saved as a log of the results of a given simulation.
GET /api/phreeqc/result/:experiment_id: Gives the frontend the status of the results of phreeqc, it makes use of reading the experiment's .lockfile.
GET /api/phreeqc/download/:experiment_id: Downloads the zip file that was created for this experiment.

Other than that, this is the whole technical explanation behind how we handle long-running requests involving these binaries. We'll be using the exact same technique when dealing with Supcrtbl.

Note:

thread.daemon = true: By default, threads in Python are non-daemons, meaning they will prevent the program from exiting until they finish running. This is useful when you need threads to do critical work that must be completed before the program ends. However, in many cases, you don't want the program to wait for background threads to finish. By setting daemon=True, you are telling the Python interpreter to allow the main program to exit even if the daemon threads are still running. This is great for background tasks. In our code, we have a cleanup thread that periodically checks long-running processes. This is a background task that runs independently of the main program. It's not critical to the main program flow, so set it as a daemon true so it doesn't block the main program. TLDR, set daemon=True to ensure they run in the background without preventing the program from exiting.
Storing experiment files: The .lockfiles and any experiment-related files are actually stored within the container's file directory themselves. Yeah, there are no bind mounts or volumes for persistence, so experiment-related files are deleted after a container stops running.
Why not Docker: Docker is kind of overkill when we're planning to just run a binary as a background task. I mean, you'd literally be creating and running a container just to run a binary. Then we'd have to somehow get the output from the Docker container, our container that's running our main application. Or we'd have to copy files from the binary's container into our own container and send those out. I mean the ways I described above are probably suboptimal approaches, but the idea is that it would be unnecessary and make the environment we're working in a lot more complex.
Why not Minio? Again, it's the idea that this is going to make the environment a little more complex. Minio is really good for storing files even after you stop your containers. In our current application, we'll store their files until our containers go down. I don't think there's a folder cleanup. But if we're being honest, it's not like we're storing work documents or things that can't be recreated. Experiment files can be recreated using the same inputs, so even if you lose your files, you can recreate them. That lessens the importance of keeping the experiment files in some super persistent system. But yeah, we should definitely set up a process that cleans the nextTDB_workdirs once a week or something.

Knguyen-dev · 2025-05-22T23:00:25Z

Knguyen-dev
May 22, 2025
Collaborator Author

Not needed anymore, everything is in #38

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecting Asynchronous Simulations: Handling Long-Running Phreeqc Tasks in a Web Service #23

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Architecting Asynchronous Simulations: Handling Long-Running Phreeqc Tasks in a Web Service #23

Uh oh!

Knguyen-dev May 4, 2025 Collaborator

Intro

Replies: 1 comment

Uh oh!

Knguyen-dev May 22, 2025 Collaborator Author

Knguyen-dev
May 4, 2025
Collaborator

Knguyen-dev
May 22, 2025
Collaborator Author