js2-gateway technical overview and onboarding #38

Knguyen-dev · 2025-05-22T22:59:14Z

Knguyen-dev
May 22, 2025
Collaborator

Js2-Gateway Introduction and Overview

Introduction

What is JS2-Gateway?

JS2-Gateway is a web-app that contains various software ("calculators") created by Zhu Laboratory to help other Geochemical modeling researchers. They have many calculators, but I guess the most important ones are "Supcrtbl" (Super Critical Table) and maybe "Phreeqc". Here you'll be maintaining the website and adding new features.

Project Setup

clone the repo
Install UV as it's our backend package manager

# Installs Astral UV, puts it on path
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="~/.local/bin/uv:$PATH"
source ~/.bashrc

Run uv sync on the backend. This should setup your backend.
Run npm install on your frontend. This should setup your frontend

You should get the .env and .env.dev files from someone via IU Secure.

General tips working with build tools

Backend and Frontend build tools

For the backend you're using the UV build tool. I recommend watching a youtube video to get familliar with how to use it. Other than that there's a pyproject.toml file, which you're unlikely to touch. Right now it just contains stuff like formatting configs for the Python formatter that we use Ruff.

The frontend should be pretty simple. You're just running commands defined in package-lock.json. Again if you're intending to run a development environment, remember to open 127.0.0.1:4000 instead of localhost:4000 since the former opens stuff in a dev and admin environment, no need to login to use protected routes.

Makefile

Before we talk about containerization, we should talk about the Makefile. You're going to run commands defined in this file to run the project as it's the centralized place to spin up the application, look at logs, exec into containers, even start the frontend, etc. If you haven't worked with Makefiles before, I recommend reading up on it. It's really helpful if you're able to understand what each command is not only doing, but also be able to write your own commands if need be.

Connecting to MySQL Database with Ivanti Secure

Essentially if you try to connect to the MySQL database on your local machine, you're going to get an error saying something along the lines of "Service Not Found". The reason being is that the MySQL database is located within IU's internal intranet, rather than the broader internet, so if you look it up, it's not going to appear. Solve this by connecting to IU's VPN. After connecting to IU's VPN, you will be able to access the js2 MySQL database:

Download and open the client
Create new connection using this server url: vpn.uits.iu.edu
Enter credentials and do duo mobile. They'll ask you for your "secondary password" but just enter in the same IU password you have.
Verify connection with database ping sasrdsmp01.uits.iu.edu

After doing this, you should have database capabilities for the backend, which is really helpful. You should get the database credentials from the person who onboarded you.

Other onboarding info and standards

Never push to main, always create a separate branch and do a pull request. It's just that we want to be able to have multiple people review a pull request to make sure things are good. Also good for the learning process.
When merging to main, squash your commits. As a result, we don't flood the commit history with all of those smaller commits, but just 1 single commit representing a merge.

System Architecture Overview

High Level Architecture

Frontend (React Application)
Backend API (Python, FastAPI)
CILogon (OIDC Provider) is the shared thing. but the redirection and all of that is backend.
External MySQL database hosted on IU servers that we do CRUD on.

Backend Architecture

API Routes

The API structure is subject to change, but you can always navigate to the http:127.0.0.1:4000/docs routes to see the endpoints that the application has:

Co2 API (/api/co2): Handles Co2 calculations.
H2S API (/api/h2s): Handles H2S calculations.
Phreeqc API (/api/phreeqc/*): Handles schedulling phreeqc sim, checking status, getting the results, downloading output files.
Rate Calculator API (/api/rate/calculate): Handles rate calculator calculations.
Supcrtbl API (/api/supcrtbl): Handles scheduling supcrtbl jobs, getting results, download output files, and getting status. There's also an endpoint for seeing the logs, but it's not really for public use, moreso for debugging and stuff. May remove it or at least make it better.
Authentication (/auth/*): OIDC and authentication routes. There's also /callback, which you shouldn't change this because that's what's registered with CILogon, our OIDC platform.
Minerals (/api/species): For getting mineral species, which is need in multiple places. Right now it's needed for Supcrtbl and RateCalculator.

Database Schema and User Management

Here are the main tables that you need to know:

class User(SQLModel, table=True):
  """Database model for user information and authorization."""

  # Actual table name of the user table on the database hosted on IU.
  # Note: Since we're using an ORM, this is abstracted out.
  __tablename__ =  "user_details"

  # Id of the user; basically required
  id: Optional[int] = Field(default=None, primary_key=True)


  # Email of the user; optional since it may be empty when authenticating the user, must be provideed when onboarded.
  email: Optional[str] = Field(default=None, max_length=45)

  # Name of the user; optional since it may be empty when we authenticate them, but must be filled when onboarded.
  name: Optional[str] = Field(default=None, max_length=45)

  # Institution or organization that the user has authenticated with
  institution: Optional[str] = Field(default=None, max_length=45)

  # Whether the user is approved to use the application software and protected routes.
  approved_user: Optional[int] = Field(default=0)

  # Whether the user is an adminsitrator
  admin_rights: Optional[int] = Field(default=0)

  # When the user was approved.
  approved_at: Optional[datetime] = Field(default=None)

  # When deleting users, we don't want to delete them from our db, so we use soft deletion. Archived 
  # users should not be able to login.
  archived: Optional[int] = Field(default=0)

  # onboarded: So when the user is registered into our system, we want them to 
  # confirm their user information with us. So here they'll confirm the name, 
  # email, and institution that they're registering with us with. If a user is onboarded, they've filled 
  # out this information and confirmed their profile. Unrelated, but after they confirm their profile, 
  # admins will get a notification that a new user has signed up and wants to be approved.
  onboarded: int = Field(default=0)

There are of course other tables that we use. For example, in the minerals API, we access MySQL tables like geosci_consolidated_tables1 and its associates. Honestly it's not that important to know these, as they moreso have scientific meanings and whatnot, and you'd probably ask the scientific staff at Zhu Lab if we need to use them. The main table you need to know is the user_details table, which we have shown above.

Most of the user management such as authentication flow, onboarding, will be covered in a different section.

Managing Calculators

There are different things to note about the calculators:

Binaries: Some calculators rely on a binary that does the mathematical and scientific calculations. Here our Python code just does server-side input validation, data cleaning and preparation, then calling the binaries as subprocesses. Then after they finish we get their and render it on the frontend. For examples of these, look at Co2, Phreeqc, and Supcrtbl.
Long running jobs: Some of the calculators (Phreeqc and Supcrtbl) run processes that can last a long time, upwards of 2 minutes, 20 minutes, etc. To handle this, we run these processes in a separate thread, aka running an async job. As a result, we just return a success to the user when the job is scheduled, and the frontend will frequently poll whether the job has finished. Once the job has finished, we'll send over the results. There's a separate write-up for this in the discussions tab, but this is just the gist of it.
Database involvement: Other calculations involve the user hitting the database, only reading from it, never writing to it.

Frontend Architecture

Page Structure

For the calculators, each calculator has a landing page that describes what the calculator is about and other links (e.g. CO2Calculator). Then there's the actual user input page where the user enters parameters into the calculator CO2CalculatorOnline. For calculators that have long-running tasks, you'd also have a page where the user does short-polling to get the status of the job, and eventually get the result of the job.

CO2Calculator: Files for the CO2 calculator
H2SCalculator: Files for the H2S calculator
Phreeqc: Files for the Phreeqc calculator
RateCalculator: Files for the rate calculator
Supcrtbl: Files for the Supcrtbl calculator
AdminPage: Page for the admin dashboard.
Home: Home Page
Onboarding: Page where the user sees their onboarding information and confirms their profile.
RateScripts: Page linking to some visual basic rate scripts. This isn't a calculator, but it's just a page to link to other scientific stuff.
AuthContext: Defines the AuthProvider which holds the authentication state of the user and handles persistence.

There's also a /components page where we store React comopnents that we re-use in the application.

Frontend Libraries and Standards

Make sure things are responsive
Use Rivet IU which is a component library that contains IU styles. This is the component library that all IU application should use. You should be able to jump right in.
We also use MUI, but only in one instance where we needed an autocomplete component. Please limit yourself from using outside and extra things like Tailwind and MUI. Keep things simple and use Rivet.

Async Jobs: Handling Long Running Tasks in a Web Service

A simulation for Phreeqc can take from a couple of seconds to up to 20 minutes to yield results, depending on the actual inputs. This is an issue because HTTP requests time out after a few seconds, so keeping the user's request in memory for that amount of time would just timeout the request. To fix this issue, we'll have to use different methods and techniques other than a basic synchronous request-response flow.

When the user does POST /api/phreeqc, assuming that their input is correct, we'll run the binary asynchronously. So even if the request is finished, the simulation will still be running on our server. After we start the Phreeqc job, we need to make sure that we return the experiment_id associated with the experiment. This is crucial for later, as it's used to identify the specific experiment simulation that was run, and also the location of the files related to that experiment. The app should then redirect the user to a results page like /phreeqc/results/:experiment_id or something similar. Here, the client will occasionally poll our backend to see if the simulation has finished using the endpoint under check_status. Once the phreeqc process is listed in a finished state, we'll send back the data to that page. That's the entire high-level overview of what happens. For other binaries and simulations that are long-running, we're going to use this method to get things done. This is all done in phreeqc_interceptor, the high-level route handler, but let's explore how we're running phreeqc asynchronously within process_manager.py.

When starting a simulation, we first create a .lockfile that will contain the status of the process that's currently running. We'll use multi-threading by creating a separate thread for running run_phreeqc_process. This separate thread handles updating the experiment's status to things like "running" (via modifying the .lockfile), and also handles sending (piping) data to the subprocess. Based on various scenarios, we write to that .lockfile with different statuses:

starting: We're beginning to start the process. We haven't run the binary yet, but we're still setting up the separate thread to run the process.
error: The process fails. Happens when the binary itself fails, returns a non-zero return code, or some other error.
completed: The binary is done running.
timeout: This happens when the simulation takes longer than what we've permitted. For phreeqc this is about 20 minutes.

We keep a dictionary running_processes that's used to keep track all the running phreeqc simulations. It contains the reference to a subprocess being run, when the simulation started, and the location of the .lockfile associated with the experiment. Its main use is so that we have references to all running phreeqc subprocesses, and through this dictionary, we can kill and clean up simulations that are running longer than expected. In the code where we start PhreeQC as a subprocess, if PhreeQC takes more than 20 minutes to complete, then the subprocess library will kill the process. As a result, we'll remove the process that's recorded in running_processes. However, if we run into an unrelated error and exit the function early, then that binary will keep running, not hitting that timeout catch block. However, at least we still have that process recorded in our dictionary! To cover cases like this, we periodically run a job to clean up and kill any long-running phreeqc processes:

Iterate through all pairs (active processes) in the dictionary
1. Using the process's start time, if it's longer than 20 minutes, then we will kill that process.
2. Update the lock file to indicate the process timed out.
For all expired or finished processes, we'll remove them from running_processes to make sure we're correctly keeping track of all running processes only. Again, one of the main purposes of this is that it lets us kill processes that are running longer than usual. Its purpose is different from the .lockfile, which tracks the status of the simulation, even after it's completed or timed out.

After getting information that their simulation is working, we expose 3 important endpoints to the client;

GET /api/phreeqc/status/:experiment_id: Gives the frontend a way to check the status of a job. This is the main function that makes use of that .lockfile that we keep saved as a log of the results of a given simulation.
GET /api/phreeqc/result/:experiment_id: Gives the frontend the status of the results of phreeqc, it makes use of reading the experiment's .lockfile.
GET /api/phreeqc/download/:experiment_id: Downloads the zip file that was created for this experiment.

Other than that, this is the whole technical explanation behind how we handle long-running requests involving these binaries. We'll be using the exact same technique when dealing with Supcrtbl.

Note:

thread.daemon = true: By default, threads in Python are non-daemons, meaning they will prevent the program from exiting until they finish running. This is useful when you need threads to do critical work that must be completed before the program ends. However, in many cases, you don't want the program to wait for background threads to finish. By setting daemon=True, you are telling the Python interpreter to allow the main program to exit even if the daemon threads are still running. This is great for background tasks. In our code, we have a cleanup thread that periodically checks long-running processes. This is a background task that runs independently of the main program. It's not critical to the main program flow, so set it as a daemon true so it doesn't block the main program. TLDR, set daemon=True to ensure they run in the background without preventing the program from exiting.
Storing experiment files: The .lock and any experiment-related files are actually stored within the container's file directory themselves. Yeah, there are no bind mounts or volumes for persistence, so experiment-related files are deleted after a container stops running.
Why not Docker: Docker is kind of overkill when we're planning to just run a binary as a background task. I mean, you'd literally be creating and running a container just to run a binary. Then we'd have to somehow get the output from the Docker container, our container that's running our main application. Or we'd have to copy files from the binary's container into our own container and send those out. I mean the ways I described above are probably suboptimal approaches, but the idea is that it would be unnecessary and make the environment we're working in a lot more complex.
Why not Minio? Again, it's the idea that this is going to make the environment a little more complex. Minio is really good for storing files even after you stop your containers. In our current application, we'll store their files until our containers go down. I don't think there's a folder cleanup. But if we're being honest, it's not like we're storing work documents or things that can't be recreated. Experiment files can be recreated using the same inputs, so even if you lose your files, you can recreate them. That lessens the importance of keeping the experiment files in some super persistent system. But yeah, we should definitely set up a process that cleans the nextTDB_workdirs once a week or something.

Authentication System

What is OIDC and CILogon

OIDC (OpenID Connect) allows users to log in through third-party identity providers (like Google), but most major providers require you to register and sometimes pay. If you aren't familiar, watch
OAuth2.0 and OpenID Connect (in plain English) - OktaDev before you even proceeed.

Now, CILogon is a free identity platform designed for researchers, universities, and federally recognized institutions. What makes CILogon special is that it supports over 5,000 identity providers, including Google, Microsoft, and various universities and research institutions. However, to use CILogon, your app must be affiliated with an approved institution (like a university or research organization). So it's free, but it comes with that restriction.

Scopes and Claims

When you authenticate with CILogon, you typically request the openid scope.Minimal claims guaranteed under openid:

sub: Unique user ID across the provider
iss: The issuer (CILogon)

Optional scopes:

email: May provide user’s email (if available)
org.cilogon.userinfo: Adds CILogon-specific claims:
idp: Identity provider’s unique ID (e.g., Google, IU)
idp_name: Human-readable name of the provider

These optional claims may not always be present, which is why our app has an onboarding stage where users fill out these claims when they aren't there. We also have other mechanisms that checks for fields that our app requires such as email.

Authentication Infrastructure and User Journey

1. User Login Flow

User clicks “Sign In” → redirected to /auth.
/auth: Begins the OIDC process, redirecting to https://cilogon.org/authorize.
- Scopes requested: profile, email, org.cilogon.userinfo.

2. Callback Handling (`/callback`)

Exchange authorization code for access token.
- If failed → redirect to home with error.
Use token to fetch user info.
- If failed or missing email → redirect to home with error.
On success:
- Set a cookie with the access token.
- Redirect to homepage.

3. Authentication Check (`/auth/me`)

Read token from cookie; if missing → reject.
Validate token with CILogon userinfo endpoint.
- If failed → clear cookie and reject.
Extract and verify user email.
Query DB for user:
- If exists → return user_auth object.
- If new → create user record with default flags (e.g., approved: false, etc.), then return response.

The client then receives this info, updates state, and the user is officially logged in.

4. Logout (`/auth/logout`)

Clears auth cookie and redirects to home.

Onboarding Flow

Post-login, users who aren’t onboarded are redirected to /onboarding.

User confirms or enters missing info (name, institution).
On form submission → hit /api/user/onboard.

Onboarding Handler (`/api/user/onboard`)

Validate access token and fetch user info.
Ensure email is present and matches submitted form.
Update user record: set onboarded, update name/institution.
If still the onboarded user isn't approved (typical case) → send emails to admins and the user.

Admin Approval Flow

Admins can update users via /api/admin/users/:userId.

Authenticate requester and ensure they are an admin.
Confirm target user exists.
Apply only sent fields (e.g., admin, approved, etc.).
If approval status changes to approved, notify relevant parties.

Email Notifications

To Admins: When a new user completes onboarding.
To User:
- Confirmation after onboarding (pending approval).
- Notification once approved.

Deployment Overview

Basics: Containerization & Reverse Proxy

The FastAPI app serves both the backend API and the built React frontend.
The app listens on port 8000 inside the container.
In production, traffic is routed from port 443 (HTTPS) on the web to port 8000 inside the app container using Caddy. We forward traffic from js2-gateway.ear180013.projects.jetstream-cloud.org to port 8000, which is where our containers are located.

Caddy + Docker Swarm (Production Setup)

Caddy is a modern web server acting as a reverse proxy with automatic HTTPS/TLS certificate management (no manual setup).
Caddy is configured via container labels and used only in production (via stack.yml).
Docker Swarm is used for:
- Running multiple services with fault tolerance
- Simple orchestration without the complexity of Kubernetes
- Smooth rolling updates (no downtime) and health checks

GitHub Actions (CI)

Located in .github/workflows/docker-publish.yml
Trigger: Runs on push to main or when relevant files are changed.
What it does:
- Builds a Docker image of the project
- Pushes it to GitHub Container Registry (GHCR)
- Image tag is typically latest

Docker Swarm + Watchtower (CD)

Deployment file: stack.yml
Watchtower monitors running containers and:
1. Detects new image versions in GHCR
2. Automatically pulls the latest image
3. Restarts the app container with the updated image
Swarm deploy config ensures:
- Zero-downtime updates (start-first strategy)
- Health checks before marking the container ready
- Reverse proxy traffic via Caddy based on domain labels
  --

Deployment Flow Summary

Developer pushes code to GitHub main
GitHub Actions builds and pushes a new Docker image to GHCR
Watchtower sees the new latest image
Watchtower pulls it and redeploys the container
Caddy routes traffic to the new container, secured with HTTPS

Knguyen-dev · 2025-05-22T23:07:10Z

Knguyen-dev
May 22, 2025
Collaborator Author

Of course there could be some inaccuracies. It could definitely do with some visuals like graphs and whatnot, but this should probably be used for onboarding new people on what js2-gateway truly is.

Yeah I just figured out the new js2-gateway runs on jet_stream remote. Just go there navigate to a folder called "Calloway" and you should find it. I should probably that that after making a commit and stuff builds, you'll go to the jet_stream, pull down changes, and do make deploy to update changes.

Note:

# Destroys current stack and redeploys it; I don't think this is what Calloway intended, but it works. The reason I resort to doing this 
# is because I read that watchtower in theory should poll every 5 minute for new builds, but after waiting 15 minutes nothing happened.
docker stack rm js2-gateway-stack
make deploy

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

js2-gateway technical overview and onboarding #38

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

js2-gateway technical overview and onboarding #38

Uh oh!

Uh oh!

Knguyen-dev May 22, 2025 Collaborator

Js2-Gateway Introduction and Overview

Introduction

What is JS2-Gateway?

Project Setup

General tips working with build tools

Backend and Frontend build tools

Makefile

Connecting to MySQL Database with Ivanti Secure

Other onboarding info and standards

System Architecture Overview

High Level Architecture

Backend Architecture

API Routes

Database Schema and User Management

Managing Calculators

Frontend Architecture

Page Structure

Frontend Libraries and Standards

Async Jobs: Handling Long Running Tasks in a Web Service

Authentication System

What is OIDC and CILogon

Scopes and Claims

Authentication Infrastructure and User Journey

1. User Login Flow

2. Callback Handling (/callback)

3. Authentication Check (/auth/me)

4. Logout (/auth/logout)

Onboarding Flow

Onboarding Handler (/api/user/onboard)

Admin Approval Flow

Email Notifications

Deployment Overview

Basics: Containerization & Reverse Proxy

Caddy + Docker Swarm (Production Setup)

GitHub Actions (CI)

Docker Swarm + Watchtower (CD)

Deployment Flow Summary

Replies: 1 comment

Uh oh!

Uh oh!

Knguyen-dev May 22, 2025 Collaborator Author

Knguyen-dev
May 22, 2025
Collaborator

2. Callback Handling (`/callback`)

3. Authentication Check (`/auth/me`)

4. Logout (`/auth/logout`)

Onboarding Handler (`/api/user/onboard`)

Knguyen-dev
May 22, 2025
Collaborator Author