ft_launcher: integrate log analysis attribution for restart decisions#269
Open
namitdhameja wants to merge 1 commit intomainfrom
Open
ft_launcher: integrate log analysis attribution for restart decisions#269namitdhameja wants to merge 1 commit intomainfrom
namitdhameja wants to merge 1 commit intomainfrom
Conversation
Contributor
Greptile SummaryIntegrates log analysis attribution into the FT launcher restart decision path. After worker failures, the launcher analyzes cycle logs to determine if the fault is transient before restarting. Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent as LocalElasticAgent
participant Handler as _handle_restart_decision
participant Attribution as LogAnalysisClient
participant Runner as runner (lib/mcp)
participant Service as AttributionService (url)
participant Progress as ProgressTracker
participant Workers as _restart_workers
Agent->>Handler: Worker failure detected
Handler->>Handler: Notify peers & open rendezvous
Handler->>Attribution: _run_attribution()
alt lib or mcp mode
Attribution->>Runner: fetch_result(log_path)
Runner->>Runner: run_log_analysis_sync()
Runner-->>Attribution: result or None
else url mode
Attribution->>Service: GET /logs?log_path=...
Service-->>Attribution: HTTP 200 with result
end
Attribution-->>Handler: should_stop decision
alt attribution says stop
Handler->>Handler: Check dry_run flag
alt dry_run enabled
Handler->>Handler: Log but proceed
else dry_run disabled
Handler-->>Agent: return False (no restart)
end
end
Handler->>Progress: analyze_previous_cycle()
Progress-->>Handler: should_terminate_early
alt no progress detected
Handler-->>Agent: return False (no restart)
else restarts available
Handler->>Workers: _restart_workers(time_consumed)
Workers->>Workers: _stop_workers(will_restart=True)
Workers->>Workers: Deduct attribution time from GPU reclaim budget
Workers->>Workers: _start_workers()
Handler-->>Agent: return True (restarted)
end
Last reviewed commit: abc9fb5 |
| if _lib_loop_starting: | ||
| pass # another caller already started the thread; wait below | ||
| else: | ||
| _lib_loop_starting = True |
Contributor
There was a problem hiding this comment.
_lib_loop_starting never reset to False on failure. If thread fails to start, subsequent callers will skip thread creation but wait indefinitely for _lib_loop_ready
Suggested change
| _lib_loop_starting = True | |
| _lib_loop_starting = True | |
| _lib_loop_ready.clear() |
Comment on lines
+87
to
+89
| if not use_lib: | ||
| future = asyncio.run_coroutine_threadsafe(_lib_analyzer.connect_mcp(), _lib_loop) | ||
| future.result(timeout=30) |
Contributor
There was a problem hiding this comment.
If MCP connect fails, _lib_analyzer left in partially initialized state. Next call will skip init (line 74 check passes) but analyzer won't be connected
5b32812 to
6f1637a
Compare
6f1637a to
abc9fb5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds log analysis attribution to the FT launcher restart path. After workers fail, the launcher runs log analysis on the cycle log before deciding whether to restart. If attribution identifies a non-transient fault, it stops instead of restarting. If attribution latency is higher then configured, decision is made without waiting for the attr result.
CLI options:
--ft-attribution-loganalysis,--ft-attribution-timeout,--ft-slack-channel,--ft-slack-token-file,--ft-dataflow-index.Attribution library:
1. sync bridge to the async
LogAnalyzervia a dedicated daemon thread + event loop; results cached per path2. Centralize post processing code (slack, dataflow) making it available for both service and library
FT launcher integration
1. Three modes:
lib(in-process),mcp(subprocess),url(HTTP service).2. Config for slack, dataflow, attribution (mode+timeout)
3. Invoked in _handle_restart_decision: attribution wall time deducted from GPU reclaim timeout budget.