ft_launcher: integrate log analysis attribution for restart decisions by namitdhameja · Pull Request #269 · NVIDIA/nvidia-resiliency-ext

namitdhameja · 2026-02-27T21:55:41Z

Adds log analysis attribution to the FT launcher restart path. After workers fail, the launcher runs log analysis on the cycle log before deciding whether to restart. If attribution identifies a non-transient fault, it stops instead of restarting. If attribution latency is higher then configured, decision is made without waiting for the attr result.

CLI options:
--ft-attribution-loganalysis,
--ft-attribution-timeout,
--ft-slack-channel,
--ft-slack-token-file,
--ft-dataflow-index.

Attribution library:
1. sync bridge to the async LogAnalyzer via a dedicated daemon thread + event loop; results cached per path
2. Centralize post processing code (slack, dataflow) making it available for both service and library

FT launcher integration
1. Three modes: lib (in-process), mcp (subprocess), url (HTTP service).
2. Config for slack, dataflow, attribution (mode+timeout)
3. Invoked in _handle_restart_decision: attribution wall time deducted from GPU reclaim timeout budget.

greptile-apps · 2026-02-27T21:59:16Z

Greptile Summary

Integrates log analysis attribution into the FT launcher restart decision path. After worker failures, the launcher analyzes cycle logs to determine if the fault is transient before restarting.

Key changes:

Three attribution modes: lib (in-process), mcp (subprocess), url (HTTP service)
New LogAnalysisClient with sync bridge to async LogAnalyzer via dedicated event loop thread
Attribution time is deducted from GPU reclaim timeout budget to prevent delays
Centralized postprocessing (Slack, dataflow) shared between lib/mcp and service modes
Six new CLI options: --ft-attribution-loganalysis, --ft-attribution-timeout, --ft-attribution-dry-run, --ft-slack-channel, --ft-slack-token-file, --ft-dataflow-index
Restart decision flow: attribution → progress tracker → remaining restarts check
If attribution identifies non-transient fault, stops instead of restarting (unless dry-run mode)

Confidence Score: 5/5

Safe to merge - well-structured integration with proper error handling and timeout management
Clean architecture with three distinct modes, proper thread safety in the event loop initialization, comprehensive timeout accounting, and good error handling throughout. Previous review concerns have been addressed with proper flag resets and analyzer cleanup.
No files require special attention - implementation is solid across all changed files

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/fault_tolerance/ft_attribution.py	New attribution integration module - creates HTTP client and LogAnalysisClient with three modes (lib, mcp, url)
src/nvidia_resiliency_ext/attribution/log_analyzer/runner.py	New sync bridge to async LogAnalyzer - uses dedicated thread + event loop for in-process and MCP modes
src/nvidia_resiliency_ext/fault_tolerance/launcher.py	Integrates attribution into restart decision flow with timeout accounting and GPU reclaim budget management
src/nvidia_resiliency_ext/fault_tolerance/config.py	Adds SlackConfig dataclass and attribution config fields with file-based token support
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py	Replaces old AttributionService with LogAnalysisClient integration in rendezvous handler
src/nvidia_resiliency_ext/attribution/postprocessing/config.py	Adds configure_postprocessing_resolved for centralized Slack/dataflow setup with env fallback

Sequence Diagram

sequenceDiagram
    participant Agent as LocalElasticAgent
    participant Handler as _handle_restart_decision
    participant Attribution as LogAnalysisClient
    participant Runner as runner (lib/mcp)
    participant Service as AttributionService (url)
    participant Progress as ProgressTracker
    participant Workers as _restart_workers

    Agent->>Handler: Worker failure detected
    Handler->>Handler: Notify peers & open rendezvous
    Handler->>Attribution: _run_attribution()
    
    alt lib or mcp mode
        Attribution->>Runner: fetch_result(log_path)
        Runner->>Runner: run_log_analysis_sync()
        Runner-->>Attribution: result or None
    else url mode
        Attribution->>Service: GET /logs?log_path=...
        Service-->>Attribution: HTTP 200 with result
    end
    
    Attribution-->>Handler: should_stop decision
    
    alt attribution says stop
        Handler->>Handler: Check dry_run flag
        alt dry_run enabled
            Handler->>Handler: Log but proceed
        else dry_run disabled
            Handler-->>Agent: return False (no restart)
        end
    end
    
    Handler->>Progress: analyze_previous_cycle()
    Progress-->>Handler: should_terminate_early
    
    alt no progress detected
        Handler-->>Agent: return False (no restart)
    else restarts available
        Handler->>Workers: _restart_workers(time_consumed)
        Workers->>Workers: _stop_workers(will_restart=True)
        Workers->>Workers: Deduct attribution time from GPU reclaim budget
        Workers->>Workers: _start_workers()
        Handler-->>Agent: return True (restarted)
    end

_{Last reviewed commit: abc9fb5}

greptile-apps

_{21 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T21:59:19Z

src/nvidia_resiliency_ext/attribution/log_analyzer/runner.py

+        if _lib_loop_starting:
+            pass  # another caller already started the thread; wait below
+        else:
+            _lib_loop_starting = True


_lib_loop_starting never reset to False on failure. If thread fails to start, subsequent callers will skip thread creation but wait indefinitely for _lib_loop_ready

Suggested change

_lib_loop_starting = True

_lib_loop_starting = True

_lib_loop_ready.clear()

greptile-apps · 2026-02-27T21:59:20Z

src/nvidia_resiliency_ext/attribution/log_analyzer/runner.py

+            if not use_lib:
+                future = asyncio.run_coroutine_threadsafe(_lib_analyzer.connect_mcp(), _lib_loop)
+                future.result(timeout=30)


If MCP connect fails, _lib_analyzer left in partially initialized state. Next call will skip init (line 74 check passes) but analyzer won't be connected

namitdhameja requested review from helisha91, hexinw-nvidia, rhewett-nv and sbak5 February 27, 2026 21:55

namitdhameja self-assigned this Feb 27, 2026

namitdhameja added the ci-approved Approved to run CI label Feb 27, 2026

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

namitdhameja force-pushed the attr-ft-integration branch from 5b32812 to 6f1637a Compare February 27, 2026 22:09

Fault tolerance integration with Log_analysis attribution module

abc9fb5

namitdhameja force-pushed the attr-ft-integration branch from 6f1637a to abc9fb5 Compare February 28, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ft_launcher: integrate log analysis attribution for restart decisions#269

ft_launcher: integrate log analysis attribution for restart decisions#269
namitdhameja wants to merge 1 commit intomainfrom
attr-ft-integration

namitdhameja commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 27, 2026

Uh oh!

greptile-apps bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_lib_loop_starting = True
	_lib_loop_starting = True
	_lib_loop_ready.clear()

Conversation

namitdhameja commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Feb 27, 2026 •

edited

Loading