-
Notifications
You must be signed in to change notification settings - Fork 41
Add graph-level webhooks for execution failure and completion #607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add graph-level webhooks for execution failure and completion #607
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughAdds webhook support: new WebhookConfig model, optional Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Controller as errored_state\n(Controller)
participant DB as Database\n(GraphTemplate)
participant BG as Background\nTasks
participant Webhook as Webhook\nEndpoint
Client->>Controller: POST error notification
Controller->>DB: Load GraphTemplate (check `webhook`)
DB-->>Controller: GraphTemplate with/without webhook
alt webhook configured AND "GRAPH_FAILED" in events AND no retry created
Controller->>BG: Schedule dispatch_webhook(payload, headers)
Controller-->>Client: Return ErroredResponseModel
BG->>Webhook: POST JSON payload (5s timeout)
alt 2xx
Webhook-->>BG: Success
Note over BG: Delivery logged
else error/timeout
Webhook-->>BG: Error
Note over BG: Warning logged, error swallowed
end
else otherwise
Controller-->>Client: Return ErroredResponseModel (no webhook)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)state-manager/app/controller/executed_state.py (4)
🔇 Additional comments (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Brijesh-Thakkar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability and integration capabilities of the system by introducing graph-level webhooks. Graph templates can now be configured to send notifications to external systems upon successful execution or failure, providing real-time updates on their lifecycle. The implementation ensures that these notifications are dispatched asynchronously, maintaining system responsiveness, and includes a more robust approach to cleaning up old trigger data. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature for observability by adding support for graph-level webhooks, specifically for graph failure events. The implementation is mostly solid, with a robust background task for dispatching webhooks and a significant improvement in how old CRON triggers are handled (moving from deletion to a TTL-based cancellation). However, there are a couple of critical issues. The new webhook configuration isn't being saved to the database when a graph template is created or updated, which prevents the feature from working as intended. Also, the handling of BackgroundTasks in errored_state could lead to blocking calls. I've added specific comments with suggestions to fix these. Additionally, the PR description mentions a webhook for successful graph execution (STATE_EXECUTED), but this does not appear to be implemented yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
state-manager/app/tasks/trigger_cron.py (1)
86-95: Clarify the comment about enum serialization and address code inconsistency.Line 93 uses
TriggerStatusEnum.PENDINGdirectly (without.value) with a comment claiming "insert() converts." However, other functions in this file consistently use.valuewhen updatingtrigger_status(lines 24, 46, 62, 112).The comment is technically correct but misleading: When an enum class inherits from str, Pydantic automatically serializes enum values as strings. Since Beanie is built on Pydantic, Beanie Documents are abstractions over Pydantic BaseModel that allow working with Python objects at the application level and JSON objects at the database level. The serialization is handled by Pydantic, not explicitly by
insert().For consistency, either use
.valueexplicitly throughout the file (as in other methods) or replace the comment with a clearer explanation: "OK because TriggerStatusEnum inherits from str and Pydantic serializes it automatically."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (10)
state-manager/app/config/settings.py(1 hunks)state-manager/app/controller/errored_state.py(2 hunks)state-manager/app/controller/executed_state.py(1 hunks)state-manager/app/controller/upsert_graph_template.py(3 hunks)state-manager/app/models/db/graph_template_model.py(3 hunks)state-manager/app/models/db/trigger.py(1 hunks)state-manager/app/models/webhook_config_model.py(1 hunks)state-manager/app/tasks/init_tasks.py(1 hunks)state-manager/app/tasks/trigger_cron.py(5 hunks)state-manager/app/tasks/webhook.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-22T14:23:18.774Z
Learnt from: NiveditJain
Repo: exospherehost/exospherehost PR: 464
File: state-manager/app/tasks/init_tasks.py:16-20
Timestamp: 2025-10-22T14:23:18.774Z
Learning: In state-manager/app/tasks/init_tasks.py, the init_tasks() function intentionally uses asyncio.gather with a list pattern (even with a single task) to provide an extensible structure for easily adding future startup tasks without refactoring.
Applied to files:
state-manager/app/tasks/init_tasks.py
📚 Learning: 2025-09-28T13:35:42.862Z
Learnt from: NiveditJain
Repo: exospherehost/exospherehost PR: 428
File: state-manager/app/tasks/verify_graph.py:4-5
Timestamp: 2025-09-28T13:35:42.862Z
Learning: In the cron trigger scheduling logic for state-manager/app/tasks/verify_graph.py, the system intentionally schedules at least one trigger beyond the trigger_ahead_time window to ensure continuity of scheduled executions. The current logic of appending an event then breaking is by design to guarantee "at least one next" trigger.
Applied to files:
state-manager/app/tasks/trigger_cron.py
📚 Learning: 2025-08-03T16:46:04.030Z
Learnt from: NiveditJain
Repo: exospherehost/exospherehost PR: 132
File: state-manager/app/controller/upsert_graph_template.py:11-27
Timestamp: 2025-08-03T16:46:04.030Z
Learning: In the exospherehost codebase, for upsert operations on graph templates, the team prioritizes API idempotency over avoiding race conditions in the database layer implementation. The approach of separate find and insert/update operations is acceptable when the API behavior remains consistent.
Applied to files:
state-manager/app/controller/upsert_graph_template.py
📚 Learning: 2025-08-06T14:04:44.515Z
Learnt from: NiveditJain
Repo: exospherehost/exospherehost PR: 158
File: state-manager/app/controller/upsert_graph_template.py:49-49
Timestamp: 2025-08-06T14:04:44.515Z
Learning: In the exospherehost codebase, the team prefers to run graph template verification on every upsert operation rather than optimizing by skipping verification when templates haven't changed, prioritizing comprehensive validation over performance optimization.
Applied to files:
state-manager/app/controller/upsert_graph_template.py
🧬 Code graph analysis (6)
state-manager/app/models/db/graph_template_model.py (1)
state-manager/app/models/webhook_config_model.py (1)
WebhookConfig(5-11)
state-manager/app/tasks/init_tasks.py (4)
state-manager/app/config/settings.py (1)
get_settings(34-38)state-manager/app/models/db/trigger.py (1)
DatabaseTriggers(9-53)state-manager/app/models/trigger_models.py (1)
TriggerStatusEnum(9-14)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)
state-manager/app/tasks/trigger_cron.py (2)
state-manager/app/models/db/trigger.py (1)
DatabaseTriggers(9-53)state-manager/app/models/trigger_models.py (1)
TriggerStatusEnum(9-14)
state-manager/app/controller/errored_state.py (3)
state-manager/app/tasks/webhook.py (1)
dispatch_webhook(10-32)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)state-manager/app/models/errored_models.py (2)
ErroredRequestModel(5-6)ErroredResponseModel(9-11)
state-manager/app/models/db/trigger.py (1)
state-manager/app/models/trigger_models.py (1)
TriggerStatusEnum(9-14)
state-manager/app/controller/upsert_graph_template.py (5)
state-manager/app/models/db/graph_template_model.py (3)
GraphTemplate(19-349)set_secrets(271-273)get_secrets(275-278)state-manager/app/tasks/verify_graph.py (1)
verify_graph(134-166)state-manager/app/models/db/trigger.py (1)
DatabaseTriggers(9-53)state-manager/app/models/trigger_models.py (2)
TriggerStatusEnum(9-14)TriggerTypeEnum(6-7)state-manager/app/config/settings.py (1)
get_settings(34-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (11)
state-manager/app/models/db/trigger.py (1)
43-52: LGTM! TTL index properly extended for CANCELLED triggers.The addition of
CANCELLEDto the TTL cleanup policy aligns with the new soft-cancellation workflow introduced intrigger_cron.py. This enables automatic cleanup of cancelled triggers after the retention period expires.state-manager/app/models/db/graph_template_model.py (1)
17-17: LGTM! Webhook field properly added to GraphTemplate.The addition of the optional
webhookfield with theWebhookConfigtype enables graph-level webhook configuration as intended by the PR. The field is appropriately optional and well-documented.Also applies to: 29-29
state-manager/app/tasks/trigger_cron.py (1)
52-75: LGTM! Cancellation helpers properly implement soft-delete with TTL.The
mark_as_cancelledandcancel_triggerfunctions correctly implement the soft-cancellation workflow:
- Sets
trigger_statustoCANCELLEDusing.valuefor MongoDB compatibility- Sets
expires_atfor TTL-based automatic cleanup- Aligns with the extended TTL index in
trigger.pyThe wrapper pattern (
cancel_trigger→mark_as_cancelled) provides a clean public API for other modules to use.state-manager/app/tasks/init_tasks.py (2)
1-11: LGTM! Imports and logger setup are appropriate.The imports cover all necessary dependencies for datetime operations, settings access, database models, and logging. The module-level logger initialization follows the codebase pattern.
51-54: LGTM!The
asyncio.gatherpattern with a single task is intentionally maintained for extensibility, consistent with the codebase convention. Based on learnings, this structure allows easily adding future startup tasks without refactoring.state-manager/app/controller/upsert_graph_template.py (6)
1-14: LGTM!Imports are well-organized and include all necessary dependencies for the new functionality (datetime utilities for TTL calculation,
get_settingsfor runtime configuration).
17-32: LGTM!Loading settings at request time ensures runtime configuration changes (like
trigger_retention_hours) are picked up without requiring a restart. The function signature correctly includes the request ID for traceability.
34-72: LGTM!The upsert logic correctly handles both update and insert paths. Capturing
old_triggersbefore the update ensures proper cleanup of stale CRON triggers. Based on learnings, the separate find and insert/update approach is acceptable for maintaining API idempotency.
73-82: LGTM!The ValueError handling correctly logs the error with request context and returns an appropriate 400 status code for validation failures.
84-115: LGTM! Correct use of soft cancellation for PENDING triggers.The bulk update correctly targets only
PENDINGtriggers (those scheduled but not yet fired), which is appropriate when replacing triggers during upsert. This differs intentionally frominit_tasks.pywhich targetsTRIGGERED/FAILEDlegacy documents for cleanup.The namespace filter (
"namespace": namespace_name) ensures isolation between namespaces—good defensive practice.
117-141: LGTM!The response correctly masks secret values (returning
Truefor presence rather than actual values), which is a good security practice. The outer exception handler appropriately logs with context and re-raises to preserve the original traceback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces graph-level webhook support for notifying external systems about graph execution lifecycle events, specifically for failure scenarios. However, the implementation is incomplete as described below.
Key Changes
- Added WebhookConfig model and webhook field to GraphTemplate for storing webhook configurations
- Implemented GRAPH_FAILED webhook dispatch in errored_state.py when a graph fails with no retries remaining
- Refactored trigger management to use CANCELLED status instead of deleting triggers, enabling TTL-based cleanup
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| state-manager/app/tasks/webhook.py | New webhook dispatch function with best-effort delivery semantics |
| state-manager/app/models/webhook_config_model.py | New WebhookConfig model defining URL, events, and optional headers |
| state-manager/app/models/db/graph_template_model.py | Added optional webhook field to GraphTemplate; minor spacing issue |
| state-manager/app/controller/upsert_graph_template.py | Refactored to cancel old triggers instead of deleting; webhook not saved to database |
| state-manager/app/controller/errored_state.py | Added GRAPH_FAILED webhook dispatch; uses deprecated datetime method |
| state-manager/app/controller/executed_state.py | Code formatting improvements; missing STATE_EXECUTED webhook |
| state-manager/app/tasks/trigger_cron.py | Fixed enum value usage; added cancel helper functions |
| state-manager/app/tasks/init_tasks.py | Changed to mark old triggers as CANCELLED instead of deleting |
| state-manager/app/models/db/trigger.py | Updated TTL index to include CANCELLED status |
| state-manager/app/config/settings.py | Minor whitespace formatting |
Comments suppressed due to low confidence (1)
state-manager/app/tasks/webhook.py:2
- Import of 'datetime' is not used.
from datetime import datetime
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6f9c901 to
a82709e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (8)
state-manager/app/models/webhook_config_model.py (1)
5-11: Add validation for webhook configuration fields.The model lacks essential validation that could prevent runtime errors:
- URL validation: The
urlfield accepts any string. Use Pydantic'sHttpUrltype or add a validator to ensure it's a valid URL.- Event validation: The
eventslist accepts arbitrary strings. Consider using an enum or validator to restrict to known event types (e.g.,GRAPH_FAILED,STATE_EXECUTED).- Headers security: No documentation warns that
headersmay contain sensitive authentication tokens that could be logged or exposed.Based on past review comments, these validation gaps remain unresolved.
state-manager/app/models/db/graph_template_model.py (1)
323-323: Missing space after comma.The
find_onecall lacks a space after the comma between arguments, which is inconsistent with PEP 8 style guidelines.state-manager/app/tasks/webhook.py (3)
2-2: Remove unused import.The
datetimeimport is not used anywhere in this file.
7-7: Consider using LogsManager for consistency.This module uses
logging.getLogger(__name__)while other modules in the codebase (e.g.,errored_state.py,trigger_cron.py) useLogsManager().get_logger()for structured logging. Using LogsManager would ensure consistent log formatting and configuration across the codebase.Based on past review comments highlighting consistency concerns.
10-26: Consider making webhook timeout configurable and logging response status.Two recommendations:
Hardcoded timeout: The 5-second timeout may not suit all webhook endpoints. Consider accepting a configurable timeout parameter (with 5 seconds as default) or reading from application settings.
Missing response logging: The function doesn't log or check the HTTP response status code. Consider logging successful dispatches (at debug or info level) and non-2xx responses (at warning level) to improve observability and debugging.
Based on past review comments regarding timeout configurability and response status logging.
state-manager/app/controller/executed_state.py (1)
15-95: Missing STATE_EXECUTED webhook dispatch mentioned in PR objectives.The PR summary states: "Dispatches STATE_EXECUTED webhook on successful execution," but this file contains no webhook dispatch logic. The
errored_state.pycontroller implements GRAPH_FAILED webhook dispatch (checking webhook config, using BackgroundTasks, calling dispatch_webhook), butexecuted_statelacks equivalent functionality.To implement STATE_EXECUTED webhooks consistently with the existing GRAPH_FAILED implementation, you would need to:
- Fetch the graph template and check for webhook configuration
- Conditionally dispatch the webhook when "STATE_EXECUTED" is in the configured events
- Include relevant payload data (namespace, graph_name, run_id, state_id(s), outputs, timestamp)
Should this functionality be added, or was it intentionally deferred?
Based on past review comments and PR objectives.
state-manager/app/controller/errored_state.py (2)
17-19: BackgroundTasks instantiation will not execute tasks.Creating a new
BackgroundTasks()instance whenNoneis passed will not work as expected.BackgroundTasksmust be managed by FastAPI's dependency injection system to ensure tasks are actually executed after the response is sent. Ifbackground_tasksisNone, the webhook dispatch on lines 81-95 will be scheduled but never run.Consider one of these solutions:
Option 1: Make background_tasks required
-async def errored_state(namespace_name: str, state_id: PydanticObjectId, body: ErroredRequestModel, x_exosphere_request_id: str, background_tasks: BackgroundTasks | None = None,) -> ErroredResponseModel: - if background_tasks is None: - background_tasks = BackgroundTasks() +async def errored_state( + namespace_name: str, + state_id: PydanticObjectId, + body: ErroredRequestModel, + x_exosphere_request_id: str, + background_tasks: BackgroundTasks, +) -> ErroredResponseModel:Option 2: Skip webhook dispatch if background_tasks is None
if ( not retry_created and graph_template.webhook and "GRAPH_FAILED" in graph_template.webhook.events + and background_tasks is not None ):Based on past review comments highlighting that FastAPI-managed BackgroundTasks is required for task execution.
92-92: Replace deprecateddatetime.utcnow()with timezone-aware alternative.Line 92 uses
datetime.utcnow(), which is deprecated in Python 3.12+ in favor ofdatetime.now(timezone.utc). The timezone-aware version is more explicit and aligns with modern Python best practices and is used consistently elsewhere in the codebase (e.g.,trigger_cron.py,init_tasks.py,upsert_graph_template.py).Apply this diff:
+from datetime import datetime, timezone - "timestamp": datetime.utcnow().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(),Based on past review comments regarding deprecated datetime usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
state-manager/app/controller/errored_state.py(2 hunks)state-manager/app/controller/executed_state.py(1 hunks)state-manager/app/models/db/graph_template_model.py(3 hunks)state-manager/app/models/webhook_config_model.py(1 hunks)state-manager/app/tasks/webhook.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
state-manager/app/models/db/graph_template_model.py (1)
state-manager/app/models/webhook_config_model.py (1)
WebhookConfig(5-11)
state-manager/app/controller/executed_state.py (5)
state-manager/app/models/executed_models.py (2)
ExecutedRequestModel(5-6)ExecutedResponseModel(9-10)state-manager/app/models/db/state.py (2)
State(13-106)insert_many(57-63)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-20)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)state-manager/app/tasks/create_next_states.py (1)
create_next_states(89-250)
state-manager/app/controller/errored_state.py (2)
state-manager/app/tasks/webhook.py (1)
dispatch_webhook(10-32)state-manager/app/models/errored_models.py (2)
ErroredRequestModel(5-6)ErroredResponseModel(9-11)
This PR adds support for graph-level webhooks to notify external systems
about graph execution lifecycle events.
Key changes:
This improves observability and enables external orchestration.
Fixes #445