Skip to content

Canary#42

Merged
antidodo merged 7 commits intomainfrom
canary
Mar 17, 2026
Merged

Canary#42
antidodo merged 7 commits intomainfrom
canary

Conversation

@antidodo
Copy link
Copy Markdown
Member

@antidodo antidodo commented Mar 17, 2026

Summary by CodeRabbit

  • New Features

    • Analysis stop events are now streamed as timestamped logs for each stopped analysis.
    • Status monitoring shares partner node statuses with peer endpoints for improved coordination.
  • Refactor

    • Status processing converted to a synchronous flow for more reliable internal polling and error handling.
    • Nginx configuration now proxies the designated storage service.

antidodo and others added 6 commits March 11, 2026 12:03
Co-authored-by: Nightknight3000 <alexander.roehl@uni-tuebingen.de>
Co-authored-by: Nightknight3000 <alexander.roehl@uni-tuebingen.de>
Co-authored-by: antidodo <albin2993@gmail.com>
Co-authored-by: antidodo <albin2993@gmail.com>
Co-authored-by: antidodo <albin2993@gmail.com>
Co-authored-by: Nightknight3000 <alexander.roehl@uni-tuebingen.de>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Replace hub_core_client with hub_client for hub interactions; add AnalysisStoppedLog and stream it when analyses stop; switch status polling code from async to sync HTTP client and add partner-status retrieval/submission; add helper functions for node status retrieval; update Kubernetes nginx service selector to use the storage service label.

Changes

Cohort / File(s) Summary
API & Logging
src/api/api.py, src/resources/log/entity.py
Replaced use of hub_core_client with hub_client; added AnalysisStoppedLog and stream it via hub_client when analyses stop.
Status & Hub utilities
src/status/status.py, src/utils/hub_client.py
Converted status polling and internal HTTP calls from async to sync (AsyncClientClient); added inform_analysis_of_partner_statuses() and helper functions get_analysis_node_statuses() / get_partner_node_statuses() to fetch and POST partner node statuses. Updated internal status/token refresh functions to synchronous variants.
Kubernetes config
src/k8s/kubernetes.py
Changed nginx config map service selector label from component=flame-result-service to component=flame-storage-service.

Sequence Diagram(s)

sequenceDiagram
    participant Poller as StatusPoller
    participant Deploy as InternalDeployment
    participant Hub as HubClient
    participant Nginx as NginxEndpoint

    Poller->>Deploy: GET /analysis/healthz (sync)
    Deploy-->>Poller: deployment status
    alt node_analysis_id present
        Poller->>Hub: GET node analyses for analysis_id
        Hub-->>Poller: node statuses (map)
        Poller->>Nginx: POST /node/partner-status (partner statuses)
        Nginx-->>Poller: 200/error
    end
    alt analysis stopped
        Poller->>Hub: stream AnalysisStoppedLog(analysis_id, node_id)
        Hub-->>Poller: ack
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through logs and status streams,
Sent partner news through nginx beams.
When analyses fold and stillness drops,
I stream a note as heartbeat stops.
A little rabbit code applauds the ops. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Canary' is vague and does not clearly describe the substantive changes in the pull request, which include API logging, Kubernetes service configuration, status synchronization, and hub client utilities. Replace the title with a descriptive summary of the main changes, such as 'Add analysis stopped logging and sync status updates' or 'Refactor hub client usage and add partner status reporting'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch canary
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

Co-authored-by: Nightknight3000 <alexander.roehl@uni-tuebingen.de>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/api/api.py`:
- Around line 199-205: Before calling stop_analysis('all', ...), capture the set
of analyses that were actually running by calling database.get_analysis_ids()
and filtering with database.analysis_is_running(analysis_id) (e.g.,
running_ids_before = [id for id in database.get_analysis_ids() if
database.analysis_is_running(id)]); then call stop_analysis('all',
self.database) and iterate only over running_ids_before (optionally re-check
database.analysis_is_running(analysis_id) is now False) when creating and
streaming AnalysisStoppedLog via stream_logs to ensure you only emit stopped
logs for analyses that truly transitioned to stopped.

In `@src/status/status.py`:
- Around line 180-185: The branch checking was_stopped never triggers because
_get_internal_deployment_status currently normalizes non-finished/running/stuck
statuses to failed; update that normalization so a true internal "stopped"
status is preserved (i.e., return AnalysisStatus.STOPPED or its .value from
_get_internal_deployment_status instead of mapping it to failed), and keep the
existing check that uses was_stopped (int_status ==
AnalysisStatus.STOPPED.value) so the finalization path for stopped runs is
executed; ensure any callers of _get_internal_deployment_status still handle the
preserved STOPPED enum/value.
- Around line 136-143: The code calls get_partner_node_statuses(...) and then
unconditionally uses node_statuses (len(node_statuses) and as JSON), which can
be None; change status submission logic in the status loop to first check
node_statuses is not None (or coerce to an empty list/dict) before computing
len() or including it in the POST body; specifically update the block that calls
get_partner_node_statuses, the subsequent print and the Client.post JSON payload
so it skips the HTTP call or sends an empty safe structure when node_statuses is
None, ensuring no TypeError is raised and the broad except in status_loop does
not hide this case.
- Around line 140-143: The per-request httpx.Client invocations in
src/status/status.py (e.g., the Client(...) used to POST to
"/analysis/partner_status" with deployment_name/PORTS and the similar clients at
the other locations) must be used as context managers (with Client(...) as
client: client.post(...)) so transports are closed instead of leaked; update
those call sites (the Client(...) creation around the partner_status POST and
the other two places around lines ~195-197) to use with and close the client
after the request. Also extend _refresh_keycloak_token() error handling to catch
httpx.ConnectError and httpx.ConnectTimeout (in addition to
httpx.HTTPStatusError), log the failure and either return an appropriate
sentinel/falsy value or retry/backoff as the surrounding code expects so
network/connectivity failures don’t bubble out of the long-running status loop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6794f777-8eee-40d6-97bb-22efea4c431a

📥 Commits

Reviewing files that changed from the base of the PR and between cfd4539 and e4303e1.

📒 Files selected for processing (5)
  • src/api/api.py
  • src/k8s/kubernetes.py
  • src/resources/log/entity.py
  • src/status/status.py
  • src/utils/hub_client.py

Comment on lines +199 to +205
response = stop_analysis('all', self.database)
for analysis_id in self.database.get_analysis_ids():
stream_logs(AnalysisStoppedLog(analysis_id),
self.node_id,
self.enable_hub_logging,
self.database,
self.hub_client)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only emit stop logs for analyses that actually transitioned to stopped.

database.get_analysis_ids() is treated elsewhere as the full set and filtered through database.analysis_is_running() in src/status/status.py Lines 55-56. Iterating it here after stop_analysis('all', ...) will also append AnalysisStoppedLog entries to finished/failed analyses, and stream_logs() can then propagate a bogus stopped lifecycle event.

Proposed fix
-            response = stop_analysis('all', self.database)
-            for analysis_id in self.database.get_analysis_ids():
+            analysis_ids = [
+                analysis_id
+                for analysis_id in self.database.get_analysis_ids()
+                if self.database.analysis_is_running(analysis_id)
+            ]
+            response = stop_analysis('all', self.database)
+            for analysis_id in analysis_ids:
                 stream_logs(AnalysisStoppedLog(analysis_id),
                             self.node_id,
                             self.enable_hub_logging,
                             self.database,
                             self.hub_client)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = stop_analysis('all', self.database)
for analysis_id in self.database.get_analysis_ids():
stream_logs(AnalysisStoppedLog(analysis_id),
self.node_id,
self.enable_hub_logging,
self.database,
self.hub_client)
analysis_ids = [
analysis_id
for analysis_id in self.database.get_analysis_ids()
if self.database.analysis_is_running(analysis_id)
]
response = stop_analysis('all', self.database)
for analysis_id in analysis_ids:
stream_logs(AnalysisStoppedLog(analysis_id),
self.node_id,
self.enable_hub_logging,
self.database,
self.hub_client)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/api.py` around lines 199 - 205, Before calling stop_analysis('all',
...), capture the set of analyses that were actually running by calling
database.get_analysis_ids() and filtering with
database.analysis_is_running(analysis_id) (e.g., running_ids_before = [id for id
in database.get_analysis_ids() if database.analysis_is_running(id)]); then call
stop_analysis('all', self.database) and iterate only over running_ids_before
(optionally re-check database.analysis_is_running(analysis_id) is now False)
when creating and streaming AnalysisStoppedLog via stream_logs to ensure you
only emit stopped logs for analyses that truly transitioned to stopped.

Comment on lines +136 to +143
node_statuses = get_partner_node_statuses(hub_client, analysis_id, node_analysis_id)
deployment_name = database.get_latest_deployment(analysis_id).deployment_name
try: # try except, in case analysis api is not yet ready
print(f"Node statuses to be submitted to {f'http://nginx-{deployment_name}:80'}: {node_statuses} (size={len(node_statuses)})")
response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
headers=[('Connection', 'close')],
json={'partner_status': node_statuses})
response.raise_for_status()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle the None fallback before touching node_statuses.

get_partner_node_statuses() returns Optional[...], but Line 139 calls len(node_statuses) unconditionally and the POST body also assumes a dict. When hub lookup fails, this turns the intended no-op path into a TypeError that gets swallowed by the broad except in status_loop().

Proposed fix
 def inform_analysis_of_partner_statuses(database: Database,
                                         hub_client: flame_hub.CoreClient,
                                         analysis_id: str,
                                         node_analysis_id: str) -> Optional[dict[str, str]]:
     node_statuses = get_partner_node_statuses(hub_client, analysis_id, node_analysis_id)
+    if node_statuses is None:
+        return None
     deployment_name = database.get_latest_deployment(analysis_id).deployment_name
     try: # try except, in case analysis api is not yet ready
         print(f"Node statuses to be submitted to {f'http://nginx-{deployment_name}:80'}: {node_statuses} (size={len(node_statuses)})")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
node_statuses = get_partner_node_statuses(hub_client, analysis_id, node_analysis_id)
deployment_name = database.get_latest_deployment(analysis_id).deployment_name
try: # try except, in case analysis api is not yet ready
print(f"Node statuses to be submitted to {f'http://nginx-{deployment_name}:80'}: {node_statuses} (size={len(node_statuses)})")
response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
headers=[('Connection', 'close')],
json={'partner_status': node_statuses})
response.raise_for_status()
node_statuses = get_partner_node_statuses(hub_client, analysis_id, node_analysis_id)
if node_statuses is None:
return None
deployment_name = database.get_latest_deployment(analysis_id).deployment_name
try: # try except, in case analysis api is not yet ready
print(f"Node statuses to be submitted to {f'http://nginx-{deployment_name}:80'}: {node_statuses} (size={len(node_statuses)})")
response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
headers=[('Connection', 'close')],
json={'partner_status': node_statuses})
response.raise_for_status()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 136 - 143, The code calls
get_partner_node_statuses(...) and then unconditionally uses node_statuses
(len(node_statuses) and as JSON), which can be None; change status submission
logic in the status loop to first check node_statuses is not None (or coerce to
an empty list/dict) before computing len() or including it in the POST body;
specifically update the block that calls get_partner_node_statuses, the
subsequent print and the Client.post JSON payload so it skips the HTTP call or
sends an empty safe structure when node_statuses is None, ensuring no TypeError
is raised and the broad except in status_loop does not hide this case.

Comment on lines +140 to +143
response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
headers=[('Connection', 'close')],
json={'partner_status': node_statuses})
response.raise_for_status()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "status.py" -path "*/status/*" | head -20

Repository: PrivateAIM/node-pod-orchestration

Length of output: 99


🏁 Script executed:

cat -n src/status/status.py | head -250

Repository: PrivateAIM/node-pod-orchestration

Length of output: 15453


🏁 Script executed:

# Check imports and the _refresh_keycloak_token function
rg "def _refresh_keycloak_token|from httpx|import httpx|ConnectError|ConnectTimeout" src/status/status.py -A 3 -B 1

Repository: PrivateAIM/node-pod-orchestration

Length of output: 1170


Close the per-request httpx.Clients and add error handling for network failures in token refresh.

These branches allocate fresh Clients without closing them. In the long-running status loop, this accumulates open transports. Additionally, _refresh_keycloak_token() at line 239 only handles HTTPStatusError but will let ConnectError/ConnectTimeout bubble out.

Suggested pattern
-        response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
-                                                                                               headers=[('Connection', 'close')],
-                                                                                               json={'partner_status': node_statuses})
+        with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+            response = client.post(url="/analysis/partner_status",
+                                   headers=[('Connection', 'close')],
+                                   json={'partner_status': node_statuses})

-            response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").get("/analysis/healthz",
-                                                                                                  headers=[('Connection', 'close')])
+            with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+                response = client.get("/analysis/healthz",
+                                      headers=[('Connection', 'close')])

-        try:
-            response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post("/analysis/token_refresh",
-                                                                                                   json={'token': keycloak_token},
-                                                                                                   headers=[('Connection', 'close')])
+        try:
+            with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+                response = client.post("/analysis/token_refresh",
+                                       json={'token': keycloak_token},
+                                       headers=[('Connection', 'close')])
             response.raise_for_status()
         except HTTPStatusError as e:
             print(f"Error: Failed to refresh keycloak token in deployment {deployment_name}.\n{e}")
+        except (ConnectError, ConnectTimeout) as e:
+            print(f"Error: Failed to reach deployment {deployment_name} for token refresh.\n{e}")

Also applies to: 195-197, 239-241

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 140 - 143, The per-request httpx.Client
invocations in src/status/status.py (e.g., the Client(...) used to POST to
"/analysis/partner_status" with deployment_name/PORTS and the similar clients at
the other locations) must be used as context managers (with Client(...) as
client: client.post(...)) so transports are closed instead of leaked; update
those call sites (the Client(...) creation around the partner_status POST and
the other two places around lines ~195-197) to use with and close the client
after the request. Also extend _refresh_keycloak_token() error handling to catch
httpx.ConnectError and httpx.ConnectTimeout (in addition to
httpx.HTTPStatusError), log the failure and either return an appropriate
sentinel/falsy value or retry/backoff as the surrounding code expects so
network/connectivity failures don’t bubble out of the long-running status loop.

Comment on lines +180 to +185
was_stopped = int_status == AnalysisStatus.STOPPED.value
if is_stuck or is_slow:
return 'unstuck'
elif newly_running:
return 'running'
elif speedy_finished or newly_ended or firmly_stuck:
elif speedy_finished or newly_ended or firmly_stuck or was_stopped:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

was_stopped is dead code until the health mapper returns stopped.

_get_internal_deployment_status() still normalizes every non-finished/running/stuck result to failed, so this branch never fires. A real internal stopped result will continue down the failure path instead of the intended finishing path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 180 - 185, The branch checking was_stopped
never triggers because _get_internal_deployment_status currently normalizes
non-finished/running/stuck statuses to failed; update that normalization so a
true internal "stopped" status is preserved (i.e., return AnalysisStatus.STOPPED
or its .value from _get_internal_deployment_status instead of mapping it to
failed), and keep the existing check that uses was_stopped (int_status ==
AnalysisStatus.STOPPED.value) so the finalization path for stopped runs is
executed; ensure any callers of _get_internal_deployment_status still handle the
preserved STOPPED enum/value.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
src/status/status.py (4)

136-150: ⚠️ Potential issue | 🟡 Minor

Handle None return from get_partner_node_statuses and close the HTTP client.

Two issues remain:

  1. get_partner_node_statuses() returns Optional[...], but the code unconditionally sends node_statuses in the POST body. When the hub lookup fails, this sends {'partner_status': null} which may not be the intended behavior.
  2. The Client instance is not closed, leaking resources in the long-running status loop.
🔧 Proposed fix
 def inform_analysis_of_partner_statuses(database: Database,
                                         hub_client: flame_hub.CoreClient,
                                         analysis_id: str,
                                         node_analysis_id: str) -> Optional[dict[str, str]]:
     node_statuses = get_partner_node_statuses(hub_client, analysis_id, node_analysis_id)
+    if node_statuses is None:
+        return None
     deployment_name = database.get_latest_deployment(analysis_id).deployment_name
     try: # try except, in case analysis api is not yet ready
-        response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post(url="/analysis/partner_status",
-                                                                                               headers=[('Connection', 'close')],
-                                                                                               json={'partner_status': node_statuses})
+        with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+            response = client.post(url="/analysis/partner_status",
+                                   headers=[('Connection', 'close')],
+                                   json={'partner_status': node_statuses})
         response.raise_for_status()
         return response.json()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 136 - 150, get_partner_node_statuses may
return None and the Client is never closed; update the block that builds
node_statuses and uses Client so you first check the result of
get_partner_node_statuses(…) and avoid sending {'partner_status': None} (either
return None/skip the POST or send an empty/validated payload) and ensure the
HTTP client is closed by using the Client as a context manager (with Client(...)
as client:) or explicitly calling client.close() after use; update the code
paths that reference node_statuses and the POST (the Client(...) .post and
response handling) so they respect the None case and always close the Client.

179-184: ⚠️ Potential issue | 🟡 Minor

was_stopped is unreachable with current status mapping.

The was_stopped check will never evaluate to True because _get_internal_deployment_status() (lines 220-228) maps any status that isn't FINISHED, RUNNING, or STUCK to FAILED. The STOPPED status is never preserved.

To enable this code path, update the status mapping in _get_internal_deployment_status:

🔧 Proposed fix in _get_internal_deployment_status
     # Map status from response to preset values
     if analysis_status == AnalysisStatus.FINISHED.value:
         health_status = AnalysisStatus.FINISHED.value
     elif analysis_status == AnalysisStatus.RUNNING.value:
         health_status = AnalysisStatus.RUNNING.value
     elif analysis_status == AnalysisStatus.STUCK.value:
         health_status = AnalysisStatus.STUCK.value
+    elif analysis_status == AnalysisStatus.STOPPED.value:
+        health_status = AnalysisStatus.STOPPED.value
     else:
         health_status = AnalysisStatus.FAILED.value
     return health_status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 179 - 184, The was_stopped branch is
unreachable because _get_internal_deployment_status currently collapses any
non-FINISHED/RUNNING/STUCK status into FAILED; update
_get_internal_deployment_status so it preserves AnalysisStatus.STOPPED (i.e.,
detect AnalysisStatus.STOPPED and return the corresponding internal status/value
rather than mapping it to FAILED) so that int_status can equal
AnalysisStatus.STOPPED.value and the was_stopped check in the status resolution
logic will be reachable; verify _get_internal_deployment_status and any
enum/value returned remain consistent with the was_stopped comparison.

238-244: ⚠️ Potential issue | 🟡 Minor

Close the client and handle network errors in token refresh.

Two issues:

  1. The Client is not closed, leaking resources.
  2. Only HTTPStatusError is caught; ConnectError and ConnectTimeout will propagate up and could disrupt the status loop.
🔧 Proposed fix
         try:
-            response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").post("/analysis/token_refresh",
-                                                                                                   json={'token': keycloak_token},
-                                                                                                   headers=[('Connection', 'close')])
+            with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+                response = client.post("/analysis/token_refresh",
+                                       json={'token': keycloak_token},
+                                       headers=[('Connection', 'close')])
             response.raise_for_status()
         except HTTPStatusError as e:
             print(f"Error: Failed to refresh keycloak token in deployment {deployment_name}.\n{e}")
+        except (ConnectError, ConnectTimeout) as e:
+            print(f"Error: Failed to reach deployment {deployment_name} for token refresh.\n{e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 238 - 244, The token refresh code uses
Client(...) without closing it and only catches HTTPStatusError; change to
create the client with a context manager (use "with Client(base_url=...) as
client:" or ensure client.aclose() in a finally block) and replace the current
except to catch HTTPStatusError, ConnectError, and ConnectTimeout (import these
exceptions) so network/connect failures are handled (log/print the error along
with deployment_name and the exception); keep the existing
response.raise_for_status() call inside the try so HTTP errors are still raised
and handled.

195-197: ⚠️ Potential issue | 🟡 Minor

Close the httpx.Client to prevent resource leaks.

The Client instance is created but never closed. In the long-running status loop with retry logic, this accumulates open transports.

🔧 Proposed fix
         try:
-            response = Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}").get("/analysis/healthz",
-                                                                                                  headers=[('Connection', 'close')])
+            with Client(base_url=f"http://nginx-{deployment_name}:{PORTS['nginx'][0]}") as client:
+                response = client.get("/analysis/healthz",
+                                      headers=[('Connection', 'close')])
             response.raise_for_status()
             break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 195 - 197, The httpx.Client is created but
never closed, causing transport/resource leaks; update the call that creates
Client(...) (the instance passed to get and whose response.raise_for_status() is
used) to either be used in a context manager (with Client(base_url=...) as
client: response = client.get(...); response.raise_for_status()) or ensure
client.close() is called in a finally block after using
response.raise_for_status(), so the Client instance is properly disposed between
retries.
🧹 Nitpick comments (1)
src/status/status.py (1)

78-86: Consider narrowing the exception catch or logging the exception type.

The broad Exception catch prevents the status loop from crashing, which is appropriate for resilience. However, it may hide unexpected errors. Consider catching specific exceptions or at minimum logging the exception type for better debuggability.

♻️ Suggested improvement
                     try:
                         # Inform local analysis of partner node statuses
                         _ = inform_analysis_of_partner_statuses(database,
                                                                 hub_client,
                                                                 analysis_id,
                                                                 node_analysis_id)
                     except Exception as e:
-                        print(f"\tPO STATUS LOOP - Error when attempting to access partner_status endpoint of {analysis_id} ({repr(e)})")
+                        print(f"\tPO STATUS LOOP - Error when attempting to access partner_status endpoint of {analysis_id} ({type(e).__name__}: {e!r})")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/status/status.py` around lines 78 - 86, Narrow the broad except in the
try/except around the call to inform_analysis_of_partner_statuses(database,
hub_client, analysis_id, node_analysis_id) by either catching specific expected
exceptions (e.g., requests.exceptions.RequestException, ConnectionError,
TimeoutError, or any domain-specific errors thrown by
inform_analysis_of_partner_statuses) or, if you must keep a general fallback, at
minimum include the exception type in the log; replace the generic except
Exception as e with targeted except clauses and change the print to include
type(e).__name__ along with repr(e) so the status loop remains resilient but
errors are easier to diagnose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/status/status.py`:
- Around line 136-150: get_partner_node_statuses may return None and the Client
is never closed; update the block that builds node_statuses and uses Client so
you first check the result of get_partner_node_statuses(…) and avoid sending
{'partner_status': None} (either return None/skip the POST or send an
empty/validated payload) and ensure the HTTP client is closed by using the
Client as a context manager (with Client(...) as client:) or explicitly calling
client.close() after use; update the code paths that reference node_statuses and
the POST (the Client(...) .post and response handling) so they respect the None
case and always close the Client.
- Around line 179-184: The was_stopped branch is unreachable because
_get_internal_deployment_status currently collapses any
non-FINISHED/RUNNING/STUCK status into FAILED; update
_get_internal_deployment_status so it preserves AnalysisStatus.STOPPED (i.e.,
detect AnalysisStatus.STOPPED and return the corresponding internal status/value
rather than mapping it to FAILED) so that int_status can equal
AnalysisStatus.STOPPED.value and the was_stopped check in the status resolution
logic will be reachable; verify _get_internal_deployment_status and any
enum/value returned remain consistent with the was_stopped comparison.
- Around line 238-244: The token refresh code uses Client(...) without closing
it and only catches HTTPStatusError; change to create the client with a context
manager (use "with Client(base_url=...) as client:" or ensure client.aclose() in
a finally block) and replace the current except to catch HTTPStatusError,
ConnectError, and ConnectTimeout (import these exceptions) so network/connect
failures are handled (log/print the error along with deployment_name and the
exception); keep the existing response.raise_for_status() call inside the try so
HTTP errors are still raised and handled.
- Around line 195-197: The httpx.Client is created but never closed, causing
transport/resource leaks; update the call that creates Client(...) (the instance
passed to get and whose response.raise_for_status() is used) to either be used
in a context manager (with Client(base_url=...) as client: response =
client.get(...); response.raise_for_status()) or ensure client.close() is called
in a finally block after using response.raise_for_status(), so the Client
instance is properly disposed between retries.

---

Nitpick comments:
In `@src/status/status.py`:
- Around line 78-86: Narrow the broad except in the try/except around the call
to inform_analysis_of_partner_statuses(database, hub_client, analysis_id,
node_analysis_id) by either catching specific expected exceptions (e.g.,
requests.exceptions.RequestException, ConnectionError, TimeoutError, or any
domain-specific errors thrown by inform_analysis_of_partner_statuses) or, if you
must keep a general fallback, at minimum include the exception type in the log;
replace the generic except Exception as e with targeted except clauses and
change the print to include type(e).__name__ along with repr(e) so the status
loop remains resilient but errors are easier to diagnose.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b3b2d6f-5959-492a-be6a-028c429ad693

📥 Commits

Reviewing files that changed from the base of the PR and between e4303e1 and 28ec410.

📒 Files selected for processing (1)
  • src/status/status.py

@antidodo antidodo merged commit 67daf88 into main Mar 17, 2026
2 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants