-
Notifications
You must be signed in to change notification settings - Fork 41
feat: add node-level timeouts to prevent stuck queued states #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
SafeDep Report SummaryPackage Details
This report is generated by SafeDep Github App |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughAdds per-node and global node timeout settings, records queued_at/timeout_at when enqueuing, stores timeout_minutes on states and registered nodes, introduces TIMEDOUT status, schedules a per-minute job to mark overdue QUEUED states TIMEDOUT, and updates controllers, models, and tests accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Scheduler
participant App as Application
participant Cron as check_node_timeout
participant Config as Settings
participant DB as State Collection
Note over Cron,Config: runs every 1 minute
Scheduler->>App: start & register job (check_node_timeout)
Scheduler-->>Cron: trigger
Cron->>Config: get_settings().node_timeout_minutes
Cron->>Cron: now_ms = time.time()*1000
Cron->>DB: update_many({status: QUEUED, timeout_at: {$lte: now_ms}} -> set status=TIMEDOUT, set error_message)
DB-->>Cron: {modified_count}
Cron-->>App: log results
sequenceDiagram
autonumber
participant Controller as enqueue_states.find_state
participant Config as Settings
participant DB as State Collection
Controller->>Controller: current_time_ms = time.time()*1000
Controller->>Config: get_settings().node_timeout_minutes
Controller->>DB: find_one_and_update(match CREATED & enqueue_after <= now, aggregation -> set status=QUEUED, queued_at=current_time_ms, timeout_at = ( $ifNull(state.timeout_minutes, global) * 60000 ) + current_time_ms)
DB-->>Controller: updated document or null
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @agam1092005, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses the issue of nodes getting indefinitely stuck in a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature to prevent nodes from being stuck in a 'QUEUED' state by implementing a timeout mechanism. The changes are well-structured, including a new configuration, a scheduled task for checking timeouts, updates to the database model with a necessary index, and a new 'TIMEDOUT' status. The accompanying unit tests are comprehensive and cover the new logic well.
My review includes a few suggestions to improve robustness and correctness:
- Handling potential configuration errors more gracefully.
- Ensuring consistency in timestamping.
- Enhancing error logging for better debuggability.
Overall, this is a solid implementation of an important feature.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
state-manager/app/config/settings.py(2 hunks)state-manager/app/controller/enqueue_states.py(1 hunks)state-manager/app/main.py(2 hunks)state-manager/app/models/db/state.py(2 hunks)state-manager/app/models/state_status_enum.py(1 hunks)state-manager/app/tasks/check_node_timeout.py(1 hunks)state-manager/tests/unit/tasks/test_check_node_timeout.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
state-manager/app/controller/enqueue_states.py (1)
state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)
state-manager/app/tasks/check_node_timeout.py (4)
state-manager/app/models/db/state.py (1)
State(13-114)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)state-manager/app/config/settings.py (1)
get_settings(34-38)
state-manager/tests/unit/tasks/test_check_node_timeout.py (2)
state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/tasks/check_node_timeout.py (1)
check_node_timeout(10-36)
state-manager/app/main.py (1)
state-manager/app/tasks/check_node_timeout.py (1)
check_node_timeout(10-36)
🔇 Additional comments (9)
state-manager/app/models/state_status_enum.py (1)
14-14: LGTM!The new
TIMEDOUTstatus is correctly added to theStateStatusEnumand appropriately placed in the "Errored" section.state-manager/app/controller/enqueue_states.py (1)
26-29: LGTM!The addition of
queued_attimestamp when transitioning to QUEUED status is correctly implemented and consistent with the timestamp format used elsewhere in the codebase (milliseconds since epoch).state-manager/app/main.py (1)
80-88: LGTM!The scheduled job for timeout checking is correctly configured with appropriate parameters. Running every minute with
max_instances=1andcoalesce=Trueensures efficient and non-overlapping execution.state-manager/app/models/db/state.py (2)
31-31: LGTM!The
queued_atfield is correctly defined asOptional[int]with an appropriate description, consistent with other timestamp fields in the model.
107-113: LGTM!The
timeout_query_indexon(status, queued_at)is well-designed to efficiently support the timeout detection query incheck_node_timeout.state-manager/app/config/settings.py (1)
25-26: LGTM!The environment variable loading is correctly implemented with appropriate type conversion and default value.
state-manager/tests/unit/tasks/test_check_node_timeout.py (3)
9-40: LGTM!Comprehensive test covering the happy path with appropriate assertions on the query structure, update operation, and modified count.
42-62: LGTM!Good coverage of the edge case where no states are timed out.
64-85: LGTM!Exception handling is properly tested with verification of error logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
state-manager/app/config/settings.py(2 hunks)state-manager/app/controller/enqueue_states.py(1 hunks)state-manager/app/tasks/check_node_timeout.py(1 hunks)state-manager/tests/unit/tasks/test_check_node_timeout.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
state-manager/app/tasks/check_node_timeout.py (4)
state-manager/app/models/db/state.py (1)
State(13-114)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)state-manager/app/config/settings.py (1)
get_settings(34-38)
state-manager/tests/unit/tasks/test_check_node_timeout.py (2)
state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/tasks/check_node_timeout.py (1)
check_node_timeout(10-36)
state-manager/app/controller/enqueue_states.py (2)
state-manager/app/models/db/state.py (1)
State(13-114)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)
🔇 Additional comments (3)
state-manager/app/controller/enqueue_states.py (1)
16-30: LGTM! Timestamp consistency addressed.The change correctly captures the current timestamp once and reuses it for both the query filter and the update operation, eliminating the inconsistency flagged in the previous review. The logic for setting
queued_atwhen transitioning toQUEUEDis sound.state-manager/app/tasks/check_node_timeout.py (1)
10-36: LGTM! Well-implemented timeout detection.The implementation correctly:
- Computes the timeout threshold using the configured
node_timeout_minutes- Queries for
QUEUEDstates with non-nullqueued_atthat exceed the threshold- Updates timed-out states to
TIMEDOUTstatus with a descriptive error message- Logs the modified count for observability
- Handles exceptions with full stack traces (addresses previous review feedback)
state-manager/tests/unit/tasks/test_check_node_timeout.py (1)
7-114: LGTM! Comprehensive test coverage.The test suite thoroughly covers the timeout detection functionality:
- Verifies states are marked as
TIMEDOUTwith correct error messages- Tests the zero-modification case
- Validates exception handling with proper logging
- Confirms threshold calculation accuracy (with realistic mock timestamp addressing previous feedback)
All tests use appropriate mocking and assertions.
|
Hi @NiveditJain , Thanks for the feedback! I have a few questions about implementing node-level timeout:
Let me know your preference and I'll proceed accordingly! |
|
@agam1092005, Please ensure that we do not do any merges/joins in enqueue request API as its the most frequently used API, we should add timeouts to states at the point of creation. |
|
@NiveditJain , Added timeout_minutes to:
Performance optimized per your requirement:
Backward compatible:
Also fixed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (10)
state-manager/app/config/settings.py(2 hunks)state-manager/app/controller/register_nodes.py(3 hunks)state-manager/app/controller/trigger_graph.py(3 hunks)state-manager/app/models/db/registered_node.py(2 hunks)state-manager/app/models/db/state.py(2 hunks)state-manager/app/models/register_nodes_request.py(1 hunks)state-manager/app/models/register_nodes_response.py(1 hunks)state-manager/app/tasks/check_node_timeout.py(1 hunks)state-manager/app/tasks/create_next_states.py(3 hunks)state-manager/tests/unit/tasks/test_check_node_timeout.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
state-manager/app/tasks/check_node_timeout.py (4)
state-manager/app/models/db/state.py (1)
State(13-116)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)state-manager/app/config/settings.py (1)
get_settings(34-38)
state-manager/app/controller/trigger_graph.py (2)
state-manager/app/models/db/registered_node.py (2)
RegisteredNode(8-45)get_by_name_and_namespace(28-32)state-manager/app/config/settings.py (1)
get_settings(34-38)
state-manager/tests/unit/tasks/test_check_node_timeout.py (2)
state-manager/app/tasks/check_node_timeout.py (1)
check_node_timeout(10-42)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)
state-manager/app/tasks/create_next_states.py (1)
state-manager/app/config/settings.py (1)
get_settings(34-38)
state-manager/app/controller/register_nodes.py (1)
state-manager/app/models/db/registered_node.py (1)
RegisteredNode(8-45)
🔇 Additional comments (27)
state-manager/app/models/register_nodes_response.py (2)
2-2: LGTM: Import addition is correct.The addition of
Optionalto the imports properly supports the newtimeout_minutesfield.
10-10: LGTM: Field definition follows best practices.The
timeout_minutesfield is correctly defined with:
- Proper type annotation (
Optional[int])- Validation constraint (
gt=0) preventing zero/negative values- Clear description explaining fallback behavior
- Consistent with the DB model and request model definitions
state-manager/app/models/db/registered_node.py (2)
3-3: LGTM: Import correctly supports the new field.
16-16: LGTM: Field definition is consistent and validated.The
timeout_minutesfield on the DB model matches the response model definition with proper validation (gt=0) and clear description. No indexing is needed since this field is only read for timeout resolution, not queried.state-manager/app/models/register_nodes_request.py (2)
2-2: LGTM: Import supports the new optional field.
10-10: LGTM: Request model field is consistent with response and DB models.The field definition is identical across request, response, and database models, ensuring consistency throughout the API layer.
state-manager/app/controller/trigger_graph.py (3)
10-10: LGTM: Imports correctly support timeout resolution.The addition of
RegisteredNodeandget_settingsimports properly enables per-node timeout lookup and global fallback.Also applies to: 13-13
96-104: LGTM: Timeout resolution correctly implements per-node override with global fallback.The logic properly:
- Queries the RegisteredNode for the root node
- Uses per-node
timeout_minuteswhen available- Falls back to global
settings.node_timeout_minuteswhen not set- Follows the precedence specified in PR objectives
106-118: LGTM: State constructor correctly includes timeout_minutes.The resolved
timeout_minutesvalue is properly passed to the State constructor, ensuring timeout configuration is propagated from node registration through state creation.state-manager/app/tasks/create_next_states.py (3)
13-13: LGTM: Import enables global timeout fallback.
166-174: LGTM: Timeout resolution efficiently reuses cached RegisteredNode.The logic correctly:
- Leverages the existing
get_registered_nodecache to avoid redundant DB queries- Implements the same per-node > global precedence as
trigger_graph.py- Maintains consistency across state creation paths
176-189: LGTM: State construction includes resolved timeout.The
timeout_minutesis correctly passed to the State constructor, ensuring timeout configuration is propagated to all next states in the graph execution.state-manager/app/controller/register_nodes.py (3)
28-36: LGTM: Update flow correctly includes timeout_minutes.The
timeout_minutesfield is properly added to the update payload alongside other node attributes, ensuring existing nodes can have their timeout configuration updated.
40-52: LGTM: Create flow correctly includes timeout_minutes.The
timeout_minutesis properly set on the RegisteredNode constructor when creating new nodes, consistent with other fields.
54-62: LGTM: Response model construction includes timeout_minutes.The API response correctly includes
timeout_minutesin the RegisteredNodeModel, ensuring clients receive the timeout configuration for registered nodes.state-manager/app/config/settings.py (2)
16-16: LGTM: Field includes validation as requested in past reviews.The
node_timeout_minutesfield correctly includes:
gt=0validation preventing zero/negative timeouts (addresses past review feedback)- Reasonable default of 30 minutes
- Clear description
Based on learnings.
25-26: LGTM: Environment loading uses consistent string defaults.Both
trigger_workersandnode_timeout_minutesnow use string defaults, allowing Pydantic to handle type coercion and validation consistently. This addresses the inconsistency flagged in past reviews.Based on learnings.
state-manager/app/tasks/check_node_timeout.py (5)
1-7: LGTM: Imports and logger setup are correct.All necessary dependencies are imported and the logger is properly initialized.
10-15: LGTM: Task initialization and timestamp calculation are correct.The function correctly loads settings and calculates the current timestamp in milliseconds for timeout comparisons.
23-34: LGTM: Timeout resolution and threshold logic are correct.The implementation correctly:
- Uses per-state
timeout_minuteswhen available, falling back to global setting- Calculates timeout threshold in milliseconds
- Compares
queued_atagainst threshold using<=(includes boundary)- Sets descriptive error message with actual timeout duration
36-39: LGTM: Bulk update efficiently handles timed-out states.The use of
State.save_all()for bulk updates is efficient, and logging the count provides useful observability.
41-42: LGTM: Exception handling includes stack trace.The error handling correctly uses
exc_info=Trueto log the full stack trace for debugging, as requested in past reviews. The task does not crash on error, allowing the scheduler to continue.Based on learnings.
state-manager/tests/unit/tasks/test_check_node_timeout.py (5)
7-60: LGTM! Comprehensive test coverage for different timeout scenarios.The test correctly validates:
- Per-node timeout handling (state1 with custom 30 min)
- States that don't exceed timeout remain QUEUED (state2)
- Global timeout fallback (state3 with None timeout_minutes)
- Bulk save operation with only timed-out states
The realistic timestamp and clear test structure make this easy to understand and maintain.
63-92: LGTM! Good negative test case.The test correctly verifies that:
- States not exceeding timeout remain in QUEUED status
- No database updates occur when no states timeout
95-108: LGTM! Exception handling properly tested.The test correctly verifies that exceptions are caught and logged with stack trace information (
exc_info=True), ensuring errors don't crash the scheduled task.
111-152: LGTM! Critical test for custom timeout precedence.The test correctly validates that:
- Custom per-node
timeout_minutestakes precedence over global setting- Global timeout is used as fallback when
timeout_minutesis None- Only states exceeding their respective timeouts are saved
This test is essential for the per-node timeout feature mentioned in the PR objectives.
1-152: Excellent test coverage and structure.The test suite comprehensively covers:
- ✅ Multiple timeout scenarios (per-node and global)
- ✅ Negative case (no timeouts)
- ✅ Exception handling
- ✅ Custom timeout precedence
The tests use realistic timestamps (addressed from previous review) and clear assertions. Well done!
|
@agam1092005 lets also fix the failing tests. Reviewing PR in greater detail meanwhile. |
| data = await State.get_pymongo_collection().find_one_and_update( | ||
| { | ||
| "namespace_name": namespace_name, | ||
| "status": StateStatusEnum.CREATED, | ||
| "node_name": { | ||
| "$in": nodes | ||
| }, | ||
| "enqueue_after": {"$lte": int(time.time() * 1000)} | ||
| "enqueue_after": {"$lte": current_time_ms} | ||
| }, | ||
| { | ||
| "$set": {"status": StateStatusEnum.QUEUED} | ||
| "$set": { | ||
| "status": StateStatusEnum.QUEUED, | ||
| "queued_at": current_time_ms | ||
| } | ||
| }, | ||
| return_document=ReturnDocument.AFTER | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query seems to be correct, however how will we repick timed-out states?
state-manager/app/models/db/state.py
Outdated
| queued_at: Optional[int] = Field(None, description="Unix time in milliseconds when state was queued") | ||
| timeout_minutes: Optional[int] = Field(None, gt=0, description="Timeout in minutes for this specific state, taken from node registration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a better model here is to store timeout_at pre-compute when would be the timeout and store that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this model of periodic jobs will work, its unnecessary as we can write a database query to figure out timeout nodes, we probably do not need to set the status timeout just from if the status is Queued and current_time > timeout_at we can figure it.
| scheduler.add_job( | ||
| check_node_timeout, | ||
| CronTrigger.from_crontab("* * * * *"), | ||
| replace_existing=True, | ||
| misfire_grace_time=60, | ||
| coalesce=True, | ||
| max_instances=1, | ||
| id="check_node_timeout_task" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed with db queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
state-manager/app/controller/enqueue_states.py(1 hunks)state-manager/app/models/db/state.py(2 hunks)state-manager/app/tasks/check_node_timeout.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
state-manager/app/tasks/check_node_timeout.py (3)
state-manager/app/models/db/state.py (1)
State(13-116)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)state-manager/app/singletons/logs_manager.py (2)
LogsManager(9-66)get_logger(65-66)
state-manager/app/controller/enqueue_states.py (3)
state-manager/app/config/settings.py (1)
get_settings(34-38)state-manager/app/models/db/state.py (1)
State(13-116)state-manager/app/models/state_status_enum.py (1)
StateStatusEnum(4-21)
🔇 Additional comments (5)
state-manager/app/models/db/state.py (2)
31-33: LGTM! Timeout fields correctly implement pre-computed timeout strategy.The three fields work together effectively:
timeout_minutesstores per-state timeout configuration with proper validation (gt=0)queued_atrecords when state enters QUEUED statustimeout_atstores pre-computed expiration timestamp for efficient queriesThis design follows the reviewer's suggestion to pre-compute
timeout_atrather than calculating it at query time, which improves query performance and simplifies the timeout detection logic.
109-115: LGTM! Index ordering is optimal for timeout queries.The compound index
(status, timeout_at)correctly supports the timeout detection query with:
statusas the first field (equality filter onQUEUED)timeout_atas the second field (range filter<= current_time_ms)This ordering allows MongoDB to efficiently scan only QUEUED states and then apply the timeout threshold, which is the optimal access pattern for the
check_node_timeouttask.state-manager/app/tasks/check_node_timeout.py (1)
9-31: LGTM! Efficient bulk update implementation.The function correctly uses a single
update_manyoperation to mark timed-out states in bulk, which is much more efficient than fetching states individually and updating them one by one. The query properly leverages thetimeout_query_indexon(status, timeout_at)for optimal performance.The implementation follows the reviewer's guidance to use a database query to identify timeout nodes rather than fetching and filtering in application code.
state-manager/app/controller/enqueue_states.py (2)
17-18: Good practice: Timestamp captured once for consistency.Pre-computing
current_time_msonce and reusing it throughout the function ensures consistency between the query filter and the values being set, addressing the concern raised in previous reviews about potential timing inconsistencies.
61-84: LGTM! Aggregation pipeline correctly implements timeout calculation.The aggregation pipeline properly:
- Sets
queued_atto the current timestamp- Calculates
timeout_atusing$addand$multiplywith the correct conversion factor (60000 ms/minute)- Falls back to the global
node_timeout_minutessetting when per-statetimeout_minutesis null using$ifNullThis approach avoids database joins (as requested by the reviewer) by computing
timeout_atinline during the state transition to QUEUED, using either the state's owntimeout_minutesor the global default.
| result = await State.get_pymongo_collection().update_many( | ||
| { | ||
| "status": StateStatusEnum.QUEUED, | ||
| "timeout_at": {"$ne": None, "$lte": current_time_ms} | ||
| }, | ||
| { | ||
| "$set": { | ||
| "status": StateStatusEnum.TIMEDOUT, | ||
| "error": "Node execution timed out" | ||
| } | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify retry/recovery mechanism for TIMEDOUT states.
The implementation correctly marks states as TIMEDOUT, but there's no visibility into how these states are subsequently handled. As noted in previous review comments, it's important to understand how timed-out states are re-picked or retried.
Run the following script to check if there's a retry mechanism for TIMEDOUT states:
🏁 Script executed:
#!/bin/bash
# Description: Search for retry/recovery logic for TIMEDOUT states
# Search for TIMEDOUT status handling
rg -n "TIMEDOUT" --type=py -C 5 | rg -i "retry|recover|manual"
# Search for state status transitions from TIMEDOUT
ast-grep --pattern $'StateStatusEnum.TIMEDOUT'
# Check if TIMEDOUT states can be manually retried or recovered
rg -n "manual_retry|retry_state" --type=py -A 10 | rg -i "timedout"Length of output: 484
Implement retry or recovery for TIMEDOUT states. I didn’t find any logic that re-queues or retries tasks once their status is set to TIMEDOUT. Add a mechanism (e.g., re-pick policy, manual recovery endpoint, or scheduled retry) in state-manager/app/tasks/check_node_timeout.py to handle these states.
🤖 Prompt for AI Agents
In state-manager/app/tasks/check_node_timeout.py around lines 16 to 27, the code
marks timed-out states as TIMEDOUT but provides no recovery; implement an
automatic retry path: change the logic to (1) for timed-out docs with
retry_count < MAX_RETRIES, set status back to QUEUED, increment retry_count,
clear error and set a next_attempt_at or timeout_at to now + backoff_ms; (2) for
timed-out docs with retry_count >= MAX_RETRIES, set status to TIMEDOUT and leave
the error as final; (3) define MAX_RETRIES and backoff policy as config
constants and ensure updates are done atomically (use update_many with $inc and
$set filters or two distinct update_many calls filtered by retry_count) so tasks
are re-queued automatically while preventing infinite retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
state-manager/app/main.py (1)
87-95: Verify necessity given past review feedback.A previous review comment on these lines stated "not needed with db queries." However, this scheduled job registration appears necessary to trigger the timeout checks periodically. Without it, the
check_node_timeouttask would never execute.Please clarify:
- Is there alternative/duplicate timeout checking logic elsewhere?
- Was the previous comment resolved or does it still apply?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
state-manager/app/config/settings.py(2 hunks)state-manager/app/main.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
state-manager/app/main.py (1)
state-manager/app/tasks/check_node_timeout.py (1)
check_node_timeout(9-33)
🔇 Additional comments (2)
state-manager/app/main.py (1)
41-41: LGTM: Import is correctly placed.The import is appropriately positioned alongside other task imports.
state-manager/app/config/settings.py (1)
16-16: LGTM: Validation correctly implemented.The field definition properly includes
gt=0validation as requested in previous feedback, ensuring only positive timeout values are accepted.
state-manager/app/config/settings.py
Outdated
| node_timeout_minutes=os.getenv("NODE_TIMEOUT_MINUTES", "30") # type: ignore | ||
| trigger_workers=int(os.getenv("TRIGGER_WORKERS", 1)), # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical syntax error: Missing comma after line 26.
Line 26 is missing a trailing comma, which will cause a Python syntax error when parsing this constructor call.
Additionally, line 27 uses explicit int() casting while line 26 correctly passes a string to let Pydantic handle type conversion and validation. This inconsistency was flagged in previous reviews.
Apply this diff to fix both issues:
- node_timeout_minutes=os.getenv("NODE_TIMEOUT_MINUTES", "30") # type: ignore
- trigger_workers=int(os.getenv("TRIGGER_WORKERS", 1)), # type: ignore
+ node_timeout_minutes=os.getenv("NODE_TIMEOUT_MINUTES", "30"), # type: ignore
+ trigger_workers=os.getenv("TRIGGER_WORKERS", "1"), # type: ignore📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| node_timeout_minutes=os.getenv("NODE_TIMEOUT_MINUTES", "30") # type: ignore | |
| trigger_workers=int(os.getenv("TRIGGER_WORKERS", 1)), # type: ignore | |
| node_timeout_minutes=os.getenv("NODE_TIMEOUT_MINUTES", "30"), # type: ignore | |
| trigger_workers=os.getenv("TRIGGER_WORKERS", "1"), # type: ignore |
🤖 Prompt for AI Agents
In state-manager/app/config/settings.py around lines 26 to 27, add the missing
trailing comma at the end of the node_timeout_minutes=... line and make
trigger_workers consistent by removing the explicit int() cast and passing the
environment value as a string (e.g. os.getenv("TRIGGER_WORKERS", "1")) so
Pydantic handles type conversion; ensure both lines end with commas and keep any
existing type-ignore comments as needed.
Updated the default value of node_timeout_minutes to enhance the timeout duration for nodes stuck in QUEUED status, improving system resilience and performance.
… enhancing state attributes - Removed the unused pipeline for calculating timeout_at in find_state function. - Added timeout_at and timeout_minutes attributes to the state model for better state management. - Updated errored_state and manual_retry_state functions to include timeout_at in state creation. - Set timeout_at and queued_at to None in re_queue_after_signal function to reset state attributes upon re-queuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
state-manager/app/controller/enqueue_states.py(1 hunks)state-manager/app/controller/errored_state.py(1 hunks)state-manager/app/controller/manual_retry_state.py(1 hunks)state-manager/app/controller/re_queue_after_signal.py(1 hunks)state-manager/app/models/db/state.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
state-manager/app/controller/enqueue_states.py (2)
state-manager/app/config/settings.py (1)
get_settings(36-40)state-manager/app/models/db/state.py (1)
State(13-116)
🔇 Additional comments (7)
state-manager/app/controller/re_queue_after_signal.py (1)
24-25: LGTM: Timeout tracking correctly reset for re-queue.Clearing
timeout_atandqueued_atensures fresh timeout calculation when the state is re-enqueued, while preservingtimeout_minutes(the static configuration). This aligns with the enqueue logic that recalculatestimeout_atfromtimeout_minutesor the global fallback.state-manager/app/models/db/state.py (2)
31-33: LGTM: Timeout fields are well-defined.The three new fields are correctly specified:
queued_atandtimeout_attrack runtime timestamps (optional, default None)timeout_minutesis validated withgt=0to ensure positive values- All fields are optional for backward compatibility
109-115: The index is already being used efficiently; the query already pushes the threshold to MongoDB.The implementation in
check_node_timeout.py(lines 16-19) already correctly includes both required conditions in the MongoDB query:
"status": StateStatusEnum.QUEUED(equality filter)"timeout_at": {"$ne": None, "$lte": current_time_ms}(threshold comparison)The compound index
(status, 1), (timeout_at, 1)is correctly ordered to support this query pattern. The threshold comparison is pushed to the database (not filtered in Python), so the index is already being leveraged efficiently. No changes are needed.Regarding "repicking timed-out states":
TIMEDOUTis a terminal status with no retry or requeue logic, similar toERROREDandCANCELLED. If automatic retry for timeouts is desired, that would be a separate feature request requiring explicit state transition logic.Likely an incorrect or invalid review comment.
state-manager/app/controller/manual_retry_state.py (1)
34-35: LGTM: Manual retry correctly preserves timeout configuration.Preserving only
timeout_minutes(nottimeout_at) is the correct approach. This allows the enqueue logic to calculate a freshtimeout_atdeadline for the retry state, ensuring adequate execution time.This contrasts with the issue in
errored_state.py(line 57) wheretimeout_atis incorrectly preserved, causing automatic retries to inherit stale deadlines.state-manager/app/controller/enqueue_states.py (3)
17-18: LGTM: Timestamp consistency fixed.Fetching
current_time_msonce and reusing it for both the query filter and the update pipeline ensures consistency and addresses the previous review concern about timestamp being fetched twice.
24-25: LGTM: Query filter is correct.The filter correctly identifies states ready for enqueuing by checking
enqueue_afteragainst the current timestamp.
27-50: Aggregation pipeline correctly implements atomic timeout calculation.The pipeline atomically updates three fields:
- Sets
statustoQUEUED- Records
queued_attimestamp- Calculates
timeout_atusing per-statetimeout_minutesor falling back to globalnode_timeout_minutesThe timeout calculation is correct:
current_time_ms + (timeout_minutes * 60000).However, regarding the reviewer's question "how will we repick timed-out states?": the current query filters for
status == CREATED, which means states markedTIMEDOUTwill not be automatically re-enqueued. Please clarify:
- Is this intentional? Should
TIMEDOUTstates require explicit manual retry or re-queue?- Or should they be automatically retried? If so, you'll need to either:
- Update the query to include
StateStatusEnum.TIMEDOUTin the status filter, OR- Have
check_node_timeouttransition timed-out states toERROREDinstead ofTIMEDOUTto leverage existing retry logic, OR- Add explicit retry logic for
TIMEDOUTstates similar toERROREDstatesThe current design treats
TIMEDOUTas a terminal state requiring manual intervention, which may or may not be the intended behavior.
| fanout_id=state.fanout_id, | ||
| timeout_at=state.timeout_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry states should get a fresh timeout window, not inherit the original deadline.
Preserving timeout_at=state.timeout_at means the retry state inherits the original timeout deadline. If the errored state was close to timing out, the retry could immediately timeout without adequate execution time.
Compare with manual_retry_state.py (line 35), which only preserves timeout_minutes and lets the enqueue logic recalculate a fresh timeout_at. This creates an inconsistency: manual retries get fresh timeouts, but automatic retries inherit stale deadlines.
Apply this diff to give retry states a fresh timeout window:
retry_state = State(
node_name=state.node_name,
namespace_name=state.namespace_name,
identifier=state.identifier,
graph_name=state.graph_name,
run_id=state.run_id,
status=StateStatusEnum.CREATED,
inputs=state.inputs,
outputs={},
error=None,
parents=state.parents,
does_unites=state.does_unites,
enqueue_after= int(time.time() * 1000) + graph_template.retry_policy.compute_delay(state.retry_count + 1),
retry_count=state.retry_count + 1,
fanout_id=state.fanout_id,
- timeout_at=state.timeout_at
+ timeout_minutes=state.timeout_minutes
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fanout_id=state.fanout_id, | |
| timeout_at=state.timeout_at | |
| fanout_id=state.fanout_id, | |
| timeout_minutes=state.timeout_minutes |
🤖 Prompt for AI Agents
In state-manager/app/controller/errored_state.py around lines 56-57, the retry
state is incorrectly inheriting the original deadline via
timeout_at=state.timeout_at; change this to preserve only timeout_minutes (e.g.,
pass timeout_minutes=state.timeout_minutes) and remove timeout_at so the enqueue
logic recalculates a fresh timeout_at (keep fanout_id as-is), making automatic
retries receive a new timeout window consistent with manual_retry_state.py line
35.
Description
Implements node-level timeouts to prevent nodes from getting stuck in QUEUED status indefinitely.
Fixes #444
Changes
Core Implementation
TIMEDOUTstatus toStateStatusEnumfor timed-out nodesqueued_attimestamp field toStatemodel with optimized index for timeout queriesNODE_TIMEOUT_MINUTESenvironment variable (default: 30 minutes)check_node_timeouttask that runs every minute to detect and timeout stuck nodesqueued_attimestampTesting
How It Works
queued_atNODE_TIMEOUT_MINUTESenvironment variableTesting
cd state-manager uv run pytest tests/unit/tasks/test_check_node_timeout.py -v