Add retry mechanism and improved error handling for processing services #1022

mihow · 2025-10-30T19:28:11Z

Summary

This PR adds retry logic and improved error handling for processing service health checks and ML pipeline requests. These are safe, non-breaking changes extracted from PR #981 to improve reliability when communicating with external processing services.

Details

Offline processing services

Currently the health check for a processing service fails very easily. As soon as you add a service it goes to offline status, when you go to process images for the first time in a while, it goes to offline status. This makes most jobs fail the first time you try them, and periodically within the job.

Now the health check better handles cold starts & retries the check on failure. Furthermore, it doesn't check the health before every image batch is sent, it just relies on the periodic check in the background.

Error messages from the processing service in the job logs

Currently if a single image or image batch fails in an unhandled way on the processing service, a very boring message is displayed in Antenna.

Now it's a little better:

Failed to process pipeline request with 2 images and 0 detections to pipeline 'constant': HTTP 500: Internal Server Error | Response text: Internal Server Error

The message is still limited because in the deep context where it is thrown we still don't know which job, which batch number, or what error happened on the processing service side. But it's a little better. The next step is to update the processing services to better catch deeper errors in PyTorch / the ML processing and translate them into better error messages in their API response -- but this will depend on the maintainers of each processing service.

Motivation

External processing services (especially serverless ones) can experience:

Cold starts requiring 30-90s to load models into memory
Transient network failures causing temporary connection errors
Unclear error messages making debugging difficult

This PR addresses these issues with automatic retries and better error reporting.

Changes

1. Retry Mechanism for Health Checks (`5c635c4`)

File: ami/ml/models/processing_service.py

Add urllib3 Retry with exponential backoff to ProcessingService.get_status()
- 3 retries with 2s backoff factor (delays: 0s, 2s, 4s)
- Retries on connection errors and status codes: 500, 502, 503, 504
Increase timeout from 6s to 90s to accommodate serverless cold starts
Prevents spurious "service offline" errors from transient failures

Benefits:

Services that are temporarily unreachable are retried automatically
Cold starts don't immediately mark services as failed
More reliable health checking with minimal code change

2. Use Cached Status Before Sending Images (`fb6fdc6`)

File: ami/ml/models/pipeline.py

Use cached last_checked_live and last_checked_latency fields instead of calling get_status() every time
Reduces unnecessary health checks when processing batches of images
Still checks all services, just uses the most recent cached status

Benefits:

Faster pipeline selection (no redundant health checks)
Lower load on processing services
Relies on periodic health checks (already running) for status updates

3. Better Error Messages from Processing Services (`d3f5839`)

File: ami/utils/requests.py

Add extract_error_message_from_response() utility function
Extracts detailed error info from FastAPI/HTTP responses
Prioritizes detail field (FastAPI standard), falls back to full JSON, text, or raw bytes
Limits output to 500 chars to avoid log spam

File: ami/ml/models/pipeline.py

Use new utility in process_images() to log clearer error messages
Prefixes errors with "Processing service request failed:" for clarity

File: ami/utils/tests.py

Add comprehensive test coverage for extract_error_message_from_response()

Benefits:

CUDA OOM errors, model loading failures, and other service errors are now clearly visible in job logs
Easier debugging for users and developers
Consistent error format across all processing service calls

Testing

These commits were cherry-picked from PR #981 where they have been tested. The changes are:

Non-breaking: Only adds retry logic and improves logging
Safe: Uses standard urllib3 Retry patterns
Backwards compatible: No API or schema changes

To test manually:

Run a job with an external processing service
Observe improved error messages if service fails
Verify retries happen automatically on transient failures

Walkthrough

Processing pipeline error handling now uses a centralized extractor for HTTP responses; processing service readiness checks use a retry-enabled session with a 90s timeout; pipeline service selection uses tracked latency/live attributes to pick the lowest-latency online service.

Changes

Cohort / File(s)	Summary
Error extraction helper & tests `ami/utils/requests.py`, `ami/utils/tests.py`	Added `extract_error_message_from_response(resp: requests.Response) -> str` to build detailed error messages (HTTP status + FastAPI `detail` → kv pairs → text → raw content). Added unit test `test_extract_error_message_from_response` covering JSON detail, generic KV JSON, JSON parse failure, and text/content fallbacks.
Pipeline request summary `ami/ml/schemas.py`	Added `PipelineRequest.summary()` instance method to return a human-friendly summary of the request (counts of source images and detections, and target pipeline).
Pipeline error handling & selection `ami/ml/models/pipeline.py`	Replaced inline response parsing with `extract_error_message_from_response`; on non-OK responses construct message using `request_data.summary()` and extracted error string, log and raise `HTTPError` if no job context. Service selection now initializes `lowest_latency = float("inf")`, checks `last_checked_live` and `last_checked_latency` to choose the lowest-latency online service, and raises if none are online.
Processing service: retry session & timeout `ami/ml/models/processing_service.py`	Switched readiness/status check to use a `create_session` (retry-enabled) and `session.get` rather than raw `requests.get`. Increased default `get_status` timeout from 6 to 90 seconds and added return type `ProcessingServiceStatusResponse`. Added docstring and assertion ensuring response exists before parsing.
Pipeline tests: error propagation `ami/ml/tests.py`	Added `test_run_pipeline_with_errors_from_processing_service` that simulates a missing image to trigger processing-service error handling and asserts the job logs contain "Failed to process".

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Pipeline.process_images
    participant Service as ProcessingService
    participant Session as HTTP Session (retry)
    participant Extract as extract_error_message_from_response
    rect rgb(240,248,255)
    Note over Pipeline,Service: Service selection (uses last_checked_live & last_checked_latency)
    Pipeline->>Service: request status (get_status via session)
    Service->>Session: GET /status (retry-enabled, timeout=90s)
    Session-->>Service: status response
    end

sequenceDiagram
    participant Pipeline as Pipeline.process_images
    participant Session as HTTP Session (retry)
    participant Service as Processing endpoint
    participant Extract as extract_error_message_from_response
    rect rgb(245,245,220)
    Pipeline->>Session: POST /process (with request_data)
    Session->>Service: HTTP request
    alt 200 OK
        Service-->>Session: 200 response
        Session-->>Pipeline: success payload
    else non-OK
        Service-->>Session: non-OK response
        Session-->>Pipeline: response
        Pipeline->>Extract: extract_error_message_from_response(resp)
        Extract-->>Pipeline: error_msg (status | detail / kv / text / content)
        Pipeline->>Pipeline: msg = f"Failed to process {request_data.summary()}: {error_msg}"
        Pipeline-->>Pipeline: log and raise HTTPError (if no job context) / attach to job.logs
    end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring attention:

Correctness and edge cases of extract_error_message_from_response (JSON shapes, binary content limits).
Compatibility of increasing get_status timeout to 90s and retry configuration with deployment expectations.
Proper maintenance and freshness of last_checked_live and last_checked_latency used by service selection.
Tests: ensure the new tests reliably simulate the various response branches and do not flake due to mocked behavior.

Poem

🐰 I sniffed the logs and found a clue,
A central extractor to parse what's true.
Sessions retry while services wake,
Latency guides the route we take.
Hops are fewer — time for carrot cake! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description provides excellent detail, motivation, and context about the changes, including clear explanations of the problems being solved, benefits, and testing approach. However, it significantly deviates from the required template structure. The description is missing several key sections including a structured "List of Changes" bullet-point list, a properly formatted "Related Issues" section with standard issue linking (such as "Relates to #981"), a "Deployment Notes" section, and the verification checklist. While the content quality is high and most information is present, the organization does not match the template's required structure, section headings, and format requirements.	The PR description should be restructured to match the template format. Please add a "List of Changes" section with bullet points (e.g., "* Added retry mechanism to ProcessingService.get_status()"), add a "Related Issues" section that uses the standard linking format (e.g., "Relates to #981"), include a "Deployment Notes" section (even if it states "No special deployment steps required"), and complete the verification checklist at the bottom. The detailed technical content can be incorporated into the "Detailed Description" section or maintained as "Changes" if you restructure the overall layout to follow the template's required sections first.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "Add retry mechanism and improved error handling for processing services" directly captures the two primary objectives of this changeset: implementing retry logic with exponential backoff for health checks and enhancing error message extraction and logging from processing services. The title is concise, specific, and clearly summarizes the main changes without vague terminology or unnecessary details. A developer scanning the commit history would immediately understand that this PR introduces reliability improvements for external processing service interactions.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/processing-service-retries

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sentry · 2025-10-30T19:29:00Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: ami/ml/models/pipeline.py

Function	Unhandled Issue
`process_images`	HTTPError: b'Internal Server Error' process_pipel... `Event Count:` 16
`process_images`	Exception: No processing services are online for the pipeline "World moths". ... `Event Count:` 8

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

ami/utils/tests.py (1)
39-60: Solid test coverage for the error extraction helper.

The test method covers the three primary scenarios: standard detail field extraction, fallback to non-standard fields, and fallback to text when JSON parsing fails. The use of Mock(spec=requests.Response) is appropriate.

Consider adding a fourth test case to exercise the raw bytes fallback at line 80 of extract_error_message_from_response (when both JSON parsing and resp.text fail):
# Test fallback to raw bytes when text access fails
mock_response.json.side_effect = ValueError("No JSON")
mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
mock_response.content = b"Raw error bytes"
result = extract_error_message_from_response(mock_response)
self.assertIn("Response content: b'Raw error bytes'", result)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b63064 and 58a7cd5.

📒 Files selected for processing (4)

ami/ml/models/pipeline.py (3 hunks)
ami/ml/models/processing_service.py (4 hunks)
ami/utils/requests.py (1 hunks)
ami/utils/tests.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

ami/ml/models/pipeline.py (2)

ami/utils/requests.py (2)

create_session (14-41)

extract_error_message_from_response (44-82)

ami/jobs/models.py (1)

logger (989-998)

ami/ml/models/processing_service.py (2)

ami/utils/requests.py (1)

create_session (14-41)

ami/ml/schemas.py (1)

ProcessingServiceStatusResponse (245-257)

ami/utils/tests.py (1)

ami/utils/requests.py (1)

extract_error_message_from_response (44-82)

🪛 Ruff (0.14.2)

ami/utils/requests.py

78-78: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Redirect rules
GitHub Check: Header rules
GitHub Check: Pages changed
GitHub Check: test

🔇 Additional comments (4)

ami/utils/requests.py (1)

44-82: LGTM! Solid error extraction with appropriate fallbacks.

The function correctly prioritizes the FastAPI "detail" field, falls back gracefully through multiple alternatives, and limits output to 500 characters to prevent log overflow. The broad Exception catch at line 78 (flagged by static analysis) is acceptable here as a final fallback when extracting resp.text fails—this ensures the function always returns a usable error message even in edge cases like encoding issues.

ami/ml/models/processing_service.py (1)

141-211: Well-designed retry mechanism for serverless cold starts.

The retry configuration (3 retries, 2s backoff, 90s timeout) appropriately handles transient failures and cold-start delays. The increased timeout from 6s to 90s makes sense for services loading multiple models into memory. The assertion at line 194 is safe since it only executes when last_checked_live=True (request succeeded).

ami/ml/models/pipeline.py (2)

245-251: Improved error handling with centralized message extraction.

The use of extract_error_message_from_response(resp) provides consistent, detailed error messages. The "Processing service request failed: " prefix makes it clear where errors originate in logs.

1041-1088: Cache staleness check is missing but acceptable given refresh interval—implement TODO in future.

The periodic task check_processing_services_online() runs every 5 minutes and refreshes the cached last_checked_live and last_checked_latency fields for all services. The method correctly uses these cached values to avoid redundant health checks.

However, the TODO at line 1044 identifies a gap: the method does not validate the max age of cached data before selecting a service. Currently, last_checked timestamps are recorded but never validated. A 5-minute maximum staleness is acceptable for typical workloads, and the retry mechanism in processing_service.py would handle transient failures from stale service selection.

Implementing the max age check is a reasonable future improvement but is not critical for this change. Document or track this as a follow-up.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ami/ml/models/pipeline.py (1)

1063-1090: Fix UnboundLocalError when all services are online but have no latency data.

If all processing services have last_checked_live=True but none have a valid last_checked_latency value, the variable processing_service_lowest_latency will never be assigned. This causes an UnboundLocalError when the code tries to log and return it on lines 1086-1090.

Apply this diff to ensure a service is always selected when services are online:

         # check the status of all processing services and pick the one with the lowest latency
         lowest_latency = float("inf")
         processing_services_online = False
+        processing_service_lowest_latency = None
 
         for processing_service in processing_services:
             if processing_service.last_checked_live:
                 processing_services_online = True
                 if (
                     processing_service.last_checked_latency
                     and processing_service.last_checked_latency < lowest_latency
                 ):
                     lowest_latency = processing_service.last_checked_latency
                     # pick the processing service that has lowest latency
                     processing_service_lowest_latency = processing_service
+                elif processing_service_lowest_latency is None:
+                    # Fallback: pick the first online service if no latency data available
+                    processing_service_lowest_latency = processing_service
 
         # if all offline then throw error
         if not processing_services_online:
             msg = f'No processing services are online for the pipeline "{pipeline_name}".'
             task_logger.error(msg)
 
             raise Exception(msg)
         else:
+            assert processing_service_lowest_latency is not None, "No service selected despite being online"
             task_logger.info(
                 f"Using processing service with latency {round(lowest_latency, 4)}: "
                 f"{processing_service_lowest_latency}"
             )
 
             return processing_service_lowest_latency

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58a7cd5 and 0433918.

📒 Files selected for processing (4)

ami/ml/models/pipeline.py (3 hunks)
ami/ml/schemas.py (1 hunks)
ami/ml/tests.py (1 hunks)
ami/utils/tests.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

ami/ml/schemas.py (2)

ui/src/data-services/models/occurrence-details.ts (1)

detections (108-110)

ui/src/data-services/models/job.ts (1)

pipeline (109-111)

ami/ml/models/pipeline.py (2)

ami/utils/requests.py (2)

create_session (14-41)

extract_error_message_from_response (44-82)

ami/ml/schemas.py (1)

summary (179-196)

ami/utils/tests.py (1)

ami/utils/requests.py (1)

extract_error_message_from_response (44-82)

ami/ml/tests.py (3)

ami/jobs/models.py (2)

Job (719-1004)

save (939-950)

ami/tests/fixtures/main.py (3)

setup_test_project (114-131)

create_captures_from_files (171-203)

create_processing_service (42-71)

ami/ml/models/pipeline.py (3)

save (1116-1122)

process_images (163-278)

process_images (1092-1106)

🪛 Ruff (0.14.2)

ami/utils/tests.py

64-64: Unused lambda argument: self

(ARG005)

ami/ml/tests.py

137-137: Probable insecure usage of temporary file or directory: "/tmp/nonexistent_image.jpg"

(S108)

139-139: Consider [error_image, *test_images[1:2]] instead of concatenation

(RUF005)

144-145: try-except-pass detected, consider logging the exception

(S110)

144-144: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (5)

ami/ml/schemas.py (1)

179-196: LGTM!

The summary() method provides clear, human-friendly request summaries with proper null handling and pluralization. This will improve error messages throughout the pipeline.

ami/utils/tests.py (1)

39-61: LGTM!

The test cases correctly validate the error message extraction logic for JSON detail fields, non-standard fields, and text fallback scenarios.

ami/ml/tests.py (1)

120-152: LGTM!

The test correctly validates that processing service errors are captured in job logs with the expected "Failed to process" message format. The use of /tmp/nonexistent_image.jpg and bare exception handling are appropriate for testing error scenarios.

ami/ml/models/pipeline.py (2)

54-54: LGTM!

Good addition of the centralized error message extractor.

244-247: LGTM!

The improved error handling provides clear, actionable error messages by combining the request summary with detailed error information from the response.

coderabbitai · 2025-10-30T21:13:22Z

ami/utils/tests.py

+        # Test fallback to raw bytes when text access fails
+        mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image")
+        mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
+        mock_response.content = b"Raw error bytes"
+        result = extract_error_message_from_response(mock_response)
+        self.assertIn("Response content: b'Raw error bytes'", result)


⚠️ Potential issue | 🟠 Major

Fix the mock property simulation for text access failure.

The current implementation assigns a property object directly to mock_response.text, which won't cause an exception when the attribute is accessed. The Mock framework doesn't interpret this as a property descriptor.

Apply this diff to properly simulate the text property raising an exception:

- # Test fallback to raw bytes when text access fails - mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image") - mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error"))) - mock_response.content = b"Raw error bytes" - result = extract_error_message_from_response(mock_response) - self.assertIn("Response content: b'Raw error bytes'", result) + # Test fallback to raw bytes when text access fails + mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image") + type(mock_response).text = property(lambda self: (_ for _ in ()).throw(Exception("text error"))) + mock_response.content = b"Raw error bytes" + result = extract_error_message_from_response(mock_response) + self.assertIn("Response content: b'Raw error bytes'", result)

Alternatively, use PropertyMock:

+ from unittest.mock import PropertyMock + # Test fallback to raw bytes when text access fails mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image") - mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error"))) + type(mock_response).text = PropertyMock(side_effect=Exception("text error")) mock_response.content = b"Raw error bytes" result = extract_error_message_from_response(mock_response) self.assertIn("Response content: b'Raw error bytes'", result)

🧰 Tools

🪛 Ruff (0.14.2)

64-64: Unused lambda argument: self

(ARG005)

🤖 Prompt for AI Agents

In ami/utils/tests.py around lines 62 to 67, the test attempts to simulate a failing .text attribute by assigning a property object directly to mock_response.text, which doesn't raise on attribute access; replace that with a proper PropertyMock (or use patch.object on the mock's class) that has side_effect=Exception("text error") so accessing mock_response.text raises the exception and the fallback to .content is exercised.

mihow added 3 commits October 30, 2025 12:21

feat: add retry mechanism for processing service checks

0fdbe81

feat: use cached status of processing service before sending images

6b33968

feat: try to show more details about 500 errors

58a7cd5

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

mihow added 3 commits October 30, 2025 13:56

feat: clarify which processing request failed

cbbc2af

feat: test for raw bytes fallback when parsing error message in requests

a38d2b6

feat: test handling of uncaught errors from processing servics

0433918

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

mihow merged commit a6044f7 into main Oct 30, 2025
7 checks passed

mihow deleted the feat/processing-service-retries branch October 30, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry mechanism and improved error handling for processing services #1022

Add retry mechanism and improved error handling for processing services #1022

Uh oh!

mihow commented Oct 30, 2025 •

edited

Loading

Uh oh!

netlify bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

sentry bot commented Oct 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add retry mechanism and improved error handling for processing services #1022

Add retry mechanism and improved error handling for processing services #1022

Uh oh!

Conversation

mihow commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Motivation

Changes

1. Retry Mechanism for Health Checks (5c635c4)

2. Use Cached Status Before Sending Images (fb6fdc6)

3. Better Error Messages from Processing Services (d3f5839)

Testing

Related

Uh oh!

netlify bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

sentry bot commented Oct 30, 2025

🔍 Existing Issues For Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Oct 30, 2025 •

edited

Loading

1. Retry Mechanism for Health Checks (`5c635c4`)

2. Use Cached Status Before Sending Images (`fb6fdc6`)

3. Better Error Messages from Processing Services (`d3f5839`)

netlify bot commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading