Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Oct 30, 2025

Summary

This PR adds retry logic and improved error handling for processing service health checks and ML pipeline requests. These are safe, non-breaking changes extracted from PR #981 to improve reliability when communicating with external processing services.

Details

Offline processing services

Currently the health check for a processing service fails very easily. As soon as you add a service it goes to offline status, when you go to process images for the first time in a while, it goes to offline status. This makes most jobs fail the first time you try them, and periodically within the job.

image

Now the health check better handles cold starts & retries the check on failure. Furthermore, it doesn't check the health before every image batch is sent, it just relies on the periodic check in the background.

Error messages from the processing service in the job logs

Currently if a single image or image batch fails in an unhandled way on the processing service, a very boring message is displayed in Antenna.

image

Now it's a little better:

Failed to process pipeline request with 2 images and 0 detections to pipeline 'constant': HTTP 500: Internal Server Error | Response text: Internal Server Error

The message is still limited because in the deep context where it is thrown we still don't know which job, which batch number, or what error happened on the processing service side. But it's a little better. The next step is to update the processing services to better catch deeper errors in PyTorch / the ML processing and translate them into better error messages in their API response -- but this will depend on the maintainers of each processing service.

Motivation

External processing services (especially serverless ones) can experience:

  • Cold starts requiring 30-90s to load models into memory
  • Transient network failures causing temporary connection errors
  • Unclear error messages making debugging difficult

This PR addresses these issues with automatic retries and better error reporting.


Changes

1. Retry Mechanism for Health Checks (5c635c4)

File: ami/ml/models/processing_service.py

  • Add urllib3 Retry with exponential backoff to ProcessingService.get_status()
    • 3 retries with 2s backoff factor (delays: 0s, 2s, 4s)
    • Retries on connection errors and status codes: 500, 502, 503, 504
  • Increase timeout from 6s to 90s to accommodate serverless cold starts
  • Prevents spurious "service offline" errors from transient failures

Benefits:

  • Services that are temporarily unreachable are retried automatically
  • Cold starts don't immediately mark services as failed
  • More reliable health checking with minimal code change

2. Use Cached Status Before Sending Images (fb6fdc6)

File: ami/ml/models/pipeline.py

  • Use cached last_checked_live and last_checked_latency fields instead of calling get_status() every time
  • Reduces unnecessary health checks when processing batches of images
  • Still checks all services, just uses the most recent cached status

Benefits:

  • Faster pipeline selection (no redundant health checks)
  • Lower load on processing services
  • Relies on periodic health checks (already running) for status updates

3. Better Error Messages from Processing Services (d3f5839)

File: ami/utils/requests.py

  • Add extract_error_message_from_response() utility function
  • Extracts detailed error info from FastAPI/HTTP responses
  • Prioritizes detail field (FastAPI standard), falls back to full JSON, text, or raw bytes
  • Limits output to 500 chars to avoid log spam

File: ami/ml/models/pipeline.py

  • Use new utility in process_images() to log clearer error messages
  • Prefixes errors with "Processing service request failed:" for clarity

File: ami/utils/tests.py

  • Add comprehensive test coverage for extract_error_message_from_response()

Benefits:

  • CUDA OOM errors, model loading failures, and other service errors are now clearly visible in job logs
  • Easier debugging for users and developers
  • Consistent error format across all processing service calls

Testing

These commits were cherry-picked from PR #981 where they have been tested. The changes are:

  • Non-breaking: Only adds retry logic and improves logging
  • Safe: Uses standard urllib3 Retry patterns
  • Backwards compatible: No API or schema changes

To test manually:

  1. Run a job with an external processing service
  2. Observe improved error messages if service fails
  3. Verify retries happen automatically on transient failures

Related

  • Extracted from PR Update status of disconnected and stale jobs #981 (feat/job-status-checks)
  • Part of broader effort to improve job reliability and monitoring
  • Follow-up PRs will add:
    • Periodic job status checking
    • Worker health monitoring
    • Docker healthchecks for Celery workers

@netlify
Copy link

netlify bot commented Oct 30, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 0433918
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6903d40905cb520008c86daf

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

Processing pipeline error handling now uses a centralized extractor for HTTP responses; processing service readiness checks use a retry-enabled session with a 90s timeout; pipeline service selection uses tracked latency/live attributes to pick the lowest-latency online service.

Changes

Cohort / File(s) Summary
Error extraction helper & tests
ami/utils/requests.py, ami/utils/tests.py
Added extract_error_message_from_response(resp: requests.Response) -> str to build detailed error messages (HTTP status + FastAPI detail → kv pairs → text → raw content). Added unit test test_extract_error_message_from_response covering JSON detail, generic KV JSON, JSON parse failure, and text/content fallbacks.
Pipeline request summary
ami/ml/schemas.py
Added PipelineRequest.summary() instance method to return a human-friendly summary of the request (counts of source images and detections, and target pipeline).
Pipeline error handling & selection
ami/ml/models/pipeline.py
Replaced inline response parsing with extract_error_message_from_response; on non-OK responses construct message using request_data.summary() and extracted error string, log and raise HTTPError if no job context. Service selection now initializes lowest_latency = float("inf"), checks last_checked_live and last_checked_latency to choose the lowest-latency online service, and raises if none are online.
Processing service: retry session & timeout
ami/ml/models/processing_service.py
Switched readiness/status check to use a create_session (retry-enabled) and session.get rather than raw requests.get. Increased default get_status timeout from 6 to 90 seconds and added return type ProcessingServiceStatusResponse. Added docstring and assertion ensuring response exists before parsing.
Pipeline tests: error propagation
ami/ml/tests.py
Added test_run_pipeline_with_errors_from_processing_service that simulates a missing image to trigger processing-service error handling and asserts the job logs contain "Failed to process".

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Pipeline.process_images
    participant Service as ProcessingService
    participant Session as HTTP Session (retry)
    participant Extract as extract_error_message_from_response
    rect rgb(240,248,255)
    Note over Pipeline,Service: Service selection (uses last_checked_live & last_checked_latency)
    Pipeline->>Service: request status (get_status via session)
    Service->>Session: GET /status (retry-enabled, timeout=90s)
    Session-->>Service: status response
    end
Loading
sequenceDiagram
    participant Pipeline as Pipeline.process_images
    participant Session as HTTP Session (retry)
    participant Service as Processing endpoint
    participant Extract as extract_error_message_from_response
    rect rgb(245,245,220)
    Pipeline->>Session: POST /process (with request_data)
    Session->>Service: HTTP request
    alt 200 OK
        Service-->>Session: 200 response
        Session-->>Pipeline: success payload
    else non-OK
        Service-->>Session: non-OK response
        Session-->>Pipeline: response
        Pipeline->>Extract: extract_error_message_from_response(resp)
        Extract-->>Pipeline: error_msg (status | detail / kv / text / content)
        Pipeline->>Pipeline: msg = f"Failed to process {request_data.summary()}: {error_msg}"
        Pipeline-->>Pipeline: log and raise HTTPError (if no job context) / attach to job.logs
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring attention:

  • Correctness and edge cases of extract_error_message_from_response (JSON shapes, binary content limits).
  • Compatibility of increasing get_status timeout to 90s and retry configuration with deployment expectations.
  • Proper maintenance and freshness of last_checked_live and last_checked_latency used by service selection.
  • Tests: ensure the new tests reliably simulate the various response branches and do not flake due to mocked behavior.

Poem

🐰 I sniffed the logs and found a clue,
A central extractor to parse what's true.
Sessions retry while services wake,
Latency guides the route we take.
Hops are fewer — time for carrot cake! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description provides excellent detail, motivation, and context about the changes, including clear explanations of the problems being solved, benefits, and testing approach. However, it significantly deviates from the required template structure. The description is missing several key sections including a structured "List of Changes" bullet-point list, a properly formatted "Related Issues" section with standard issue linking (such as "Relates to #981"), a "Deployment Notes" section, and the verification checklist. While the content quality is high and most information is present, the organization does not match the template's required structure, section headings, and format requirements. The PR description should be restructured to match the template format. Please add a "List of Changes" section with bullet points (e.g., "* Added retry mechanism to ProcessingService.get_status()"), add a "Related Issues" section that uses the standard linking format (e.g., "Relates to #981"), include a "Deployment Notes" section (even if it states "No special deployment steps required"), and complete the verification checklist at the bottom. The detailed technical content can be incorporated into the "Detailed Description" section or maintained as "Changes" if you restructure the overall layout to follow the template's required sections first.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Add retry mechanism and improved error handling for processing services" directly captures the two primary objectives of this changeset: implementing retry logic with exponential backoff for health checks and enhancing error message extraction and logging from processing services. The title is concise, specific, and clearly summarizes the main changes without vague terminology or unnecessary details. A developer scanning the commit history would immediately understand that this PR introduces reliability improvements for external processing service interactions.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/processing-service-retries

Comment @coderabbitai help to get the list of available commands and usage tips.

@sentry
Copy link

sentry bot commented Oct 30, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: ami/ml/models/pipeline.py

Function Unhandled Issue
process_images HTTPError: b'Internal Server Error' process_pipel...
Event Count: 16
process_images Exception: No processing services are online for the pipeline "World moths". ...
Event Count: 8

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ami/utils/tests.py (1)

39-60: Solid test coverage for the error extraction helper.

The test method covers the three primary scenarios: standard detail field extraction, fallback to non-standard fields, and fallback to text when JSON parsing fails. The use of Mock(spec=requests.Response) is appropriate.

Consider adding a fourth test case to exercise the raw bytes fallback at line 80 of extract_error_message_from_response (when both JSON parsing and resp.text fail):

# Test fallback to raw bytes when text access fails
mock_response.json.side_effect = ValueError("No JSON")
mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
mock_response.content = b"Raw error bytes"
result = extract_error_message_from_response(mock_response)
self.assertIn("Response content: b'Raw error bytes'", result)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b63064 and 58a7cd5.

📒 Files selected for processing (4)
  • ami/ml/models/pipeline.py (3 hunks)
  • ami/ml/models/processing_service.py (4 hunks)
  • ami/utils/requests.py (1 hunks)
  • ami/utils/tests.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
ami/ml/models/pipeline.py (2)
ami/utils/requests.py (2)
  • create_session (14-41)
  • extract_error_message_from_response (44-82)
ami/jobs/models.py (1)
  • logger (989-998)
ami/ml/models/processing_service.py (2)
ami/utils/requests.py (1)
  • create_session (14-41)
ami/ml/schemas.py (1)
  • ProcessingServiceStatusResponse (245-257)
ami/utils/tests.py (1)
ami/utils/requests.py (1)
  • extract_error_message_from_response (44-82)
🪛 Ruff (0.14.2)
ami/utils/requests.py

78-78: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Redirect rules
  • GitHub Check: Header rules
  • GitHub Check: Pages changed
  • GitHub Check: test
🔇 Additional comments (4)
ami/utils/requests.py (1)

44-82: LGTM! Solid error extraction with appropriate fallbacks.

The function correctly prioritizes the FastAPI "detail" field, falls back gracefully through multiple alternatives, and limits output to 500 characters to prevent log overflow. The broad Exception catch at line 78 (flagged by static analysis) is acceptable here as a final fallback when extracting resp.text fails—this ensures the function always returns a usable error message even in edge cases like encoding issues.

ami/ml/models/processing_service.py (1)

141-211: Well-designed retry mechanism for serverless cold starts.

The retry configuration (3 retries, 2s backoff, 90s timeout) appropriately handles transient failures and cold-start delays. The increased timeout from 6s to 90s makes sense for services loading multiple models into memory. The assertion at line 194 is safe since it only executes when last_checked_live=True (request succeeded).

ami/ml/models/pipeline.py (2)

245-251: Improved error handling with centralized message extraction.

The use of extract_error_message_from_response(resp) provides consistent, detailed error messages. The "Processing service request failed: " prefix makes it clear where errors originate in logs.


1041-1088: Cache staleness check is missing but acceptable given refresh interval—implement TODO in future.

The periodic task check_processing_services_online() runs every 5 minutes and refreshes the cached last_checked_live and last_checked_latency fields for all services. The method correctly uses these cached values to avoid redundant health checks.

However, the TODO at line 1044 identifies a gap: the method does not validate the max age of cached data before selecting a service. Currently, last_checked timestamps are recorded but never validated. A 5-minute maximum staleness is acceptable for typical workloads, and the retry mechanism in processing_service.py would handle transient failures from stale service selection.

Implementing the max age check is a reasonable future improvement but is not critical for this change. Document or track this as a follow-up.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ami/ml/models/pipeline.py (1)

1063-1090: Fix UnboundLocalError when all services are online but have no latency data.

If all processing services have last_checked_live=True but none have a valid last_checked_latency value, the variable processing_service_lowest_latency will never be assigned. This causes an UnboundLocalError when the code tries to log and return it on lines 1086-1090.

Apply this diff to ensure a service is always selected when services are online:

         # check the status of all processing services and pick the one with the lowest latency
         lowest_latency = float("inf")
         processing_services_online = False
+        processing_service_lowest_latency = None
 
         for processing_service in processing_services:
             if processing_service.last_checked_live:
                 processing_services_online = True
                 if (
                     processing_service.last_checked_latency
                     and processing_service.last_checked_latency < lowest_latency
                 ):
                     lowest_latency = processing_service.last_checked_latency
                     # pick the processing service that has lowest latency
                     processing_service_lowest_latency = processing_service
+                elif processing_service_lowest_latency is None:
+                    # Fallback: pick the first online service if no latency data available
+                    processing_service_lowest_latency = processing_service
 
         # if all offline then throw error
         if not processing_services_online:
             msg = f'No processing services are online for the pipeline "{pipeline_name}".'
             task_logger.error(msg)
 
             raise Exception(msg)
         else:
+            assert processing_service_lowest_latency is not None, "No service selected despite being online"
             task_logger.info(
                 f"Using processing service with latency {round(lowest_latency, 4)}: "
                 f"{processing_service_lowest_latency}"
             )
 
             return processing_service_lowest_latency
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58a7cd5 and 0433918.

📒 Files selected for processing (4)
  • ami/ml/models/pipeline.py (3 hunks)
  • ami/ml/schemas.py (1 hunks)
  • ami/ml/tests.py (1 hunks)
  • ami/utils/tests.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
ami/ml/schemas.py (2)
ui/src/data-services/models/occurrence-details.ts (1)
  • detections (108-110)
ui/src/data-services/models/job.ts (1)
  • pipeline (109-111)
ami/ml/models/pipeline.py (2)
ami/utils/requests.py (2)
  • create_session (14-41)
  • extract_error_message_from_response (44-82)
ami/ml/schemas.py (1)
  • summary (179-196)
ami/utils/tests.py (1)
ami/utils/requests.py (1)
  • extract_error_message_from_response (44-82)
ami/ml/tests.py (3)
ami/jobs/models.py (2)
  • Job (719-1004)
  • save (939-950)
ami/tests/fixtures/main.py (3)
  • setup_test_project (114-131)
  • create_captures_from_files (171-203)
  • create_processing_service (42-71)
ami/ml/models/pipeline.py (3)
  • save (1116-1122)
  • process_images (163-278)
  • process_images (1092-1106)
🪛 Ruff (0.14.2)
ami/utils/tests.py

64-64: Unused lambda argument: self

(ARG005)

ami/ml/tests.py

137-137: Probable insecure usage of temporary file or directory: "/tmp/nonexistent_image.jpg"

(S108)


139-139: Consider [error_image, *test_images[1:2]] instead of concatenation

(RUF005)


144-145: try-except-pass detected, consider logging the exception

(S110)


144-144: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (5)
ami/ml/schemas.py (1)

179-196: LGTM!

The summary() method provides clear, human-friendly request summaries with proper null handling and pluralization. This will improve error messages throughout the pipeline.

ami/utils/tests.py (1)

39-61: LGTM!

The test cases correctly validate the error message extraction logic for JSON detail fields, non-standard fields, and text fallback scenarios.

ami/ml/tests.py (1)

120-152: LGTM!

The test correctly validates that processing service errors are captured in job logs with the expected "Failed to process" message format. The use of /tmp/nonexistent_image.jpg and bare exception handling are appropriate for testing error scenarios.

ami/ml/models/pipeline.py (2)

54-54: LGTM!

Good addition of the centralized error message extractor.


244-247: LGTM!

The improved error handling provides clear, actionable error messages by combining the request summary with detailed error information from the response.

Comment on lines +62 to +67
# Test fallback to raw bytes when text access fails
mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image")
mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
mock_response.content = b"Raw error bytes"
result = extract_error_message_from_response(mock_response)
self.assertIn("Response content: b'Raw error bytes'", result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the mock property simulation for text access failure.

The current implementation assigns a property object directly to mock_response.text, which won't cause an exception when the attribute is accessed. The Mock framework doesn't interpret this as a property descriptor.

Apply this diff to properly simulate the text property raising an exception:

-        # Test fallback to raw bytes when text access fails
-        mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image")
-        mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
-        mock_response.content = b"Raw error bytes"
-        result = extract_error_message_from_response(mock_response)
-        self.assertIn("Response content: b'Raw error bytes'", result)
+        # Test fallback to raw bytes when text access fails
+        mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image")
+        type(mock_response).text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
+        mock_response.content = b"Raw error bytes"
+        result = extract_error_message_from_response(mock_response)
+        self.assertIn("Response content: b'Raw error bytes'", result)

Alternatively, use PropertyMock:

+        from unittest.mock import PropertyMock
+        
         # Test fallback to raw bytes when text access fails
         mock_response.json.side_effect = ValueError("404 Not Found: Could not fetch image")
-        mock_response.text = property(lambda self: (_ for _ in ()).throw(Exception("text error")))
+        type(mock_response).text = PropertyMock(side_effect=Exception("text error"))
         mock_response.content = b"Raw error bytes"
         result = extract_error_message_from_response(mock_response)
         self.assertIn("Response content: b'Raw error bytes'", result)
🧰 Tools
🪛 Ruff (0.14.2)

64-64: Unused lambda argument: self

(ARG005)

🤖 Prompt for AI Agents
In ami/utils/tests.py around lines 62 to 67, the test attempts to simulate a
failing .text attribute by assigning a property object directly to
mock_response.text, which doesn't raise on attribute access; replace that with a
proper PropertyMock (or use patch.object on the mock's class) that has
side_effect=Exception("text error") so accessing mock_response.text raises the
exception and the fallback to .content is exercised.

@mihow mihow merged commit a6044f7 into main Oct 30, 2025
7 checks passed
@mihow mihow deleted the feat/processing-service-retries branch October 30, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants