[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging by mirza-halilcevic · Pull Request #2208 · ROCm/rocMLIR

mirza-halilcevic · 2026-01-18T12:19:04Z

Motivation

Improve crash recovery, informational output, and error reporting.

Technical Details

Introduce state file to keep track of each configs state. Used to detect crashes and skip over repeatedly failing/crashing configs over multiple runs.
Use python logger instead of prints.
Introduce --verbose flag for debug output. Keep --debug only for debug file generation.
Support stdin for configs input.
Improve error reporting and let fatal exceptions propagate.
Save elapsed tuning time to output file and track ETA.
Introduce --timeout flag to specify a timeout for the tuning-driver.
Add tuningSpace, commitId, timestamp, durationSec as fields in the output.

Test Plan

This branch was used to create the tuning databases from which the quick-tune lists in #2212 were generated.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR enhances the tuning infrastructure with state file management for crash recovery and comprehensive logging improvements. The changes enable the tuner to track configuration states (running, failed, crashed, interrupted), persist them across runs, and recover gracefully from interruptions or crashes.

Changes:

Added JSON state file mechanism to track tuning progress and enable crash recovery
Introduced structured logging with color-coded output and tqdm integration
Enhanced error reporting with detailed context and formatted output
Added --retry-failed flag to selectively retry failed/crashed configs
Improved progress tracking with ETA estimation based on median completion times

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
mlir/utils/performance/tuningRunner.py	Core implementation of state management, logging infrastructure, and enhanced error handling
mlir/utils/performance/perfRunner.py	Simplified tuning database reader to handle variable column counts
mlir/utils/jenkins/Jenkinsfile.downstream	Removed `--quiet` flag from CI tuning commands
mlir/utils/jenkins/Jenkinsfile	Removed `--quiet` flag from fusion tuning commands
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp	Removed obsolete comment about hidden warning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/utils/performance/tuningRunner.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlir/utils/performance/tuningRunner.py

dorde-antic

lgtm

…ecovery and improve logging (#2208) * Improve time and metadata tracking, output file format has changed. * Add support for stdin. * Revert to old output format. * Add state file for crash and interrupt recovery. * Add --retry-failed option. * Use proper python logger. * Improve readability of output. * Log warnings from tuning driver. * Reintroduce --quiet flag. * Let important exceptions propagate and clean up code. * Simplify state file and support multiple contexts. * Address copilot comments. * Show tuning-driver output during failures. * Add --status option. * Improve order of logs for easier tracking. * Update github ci python version. * Fix state transitions. * Add timeout option. * Improve output file format. * Simplify output file writing. * Address code review comments. * Use llvm dbgs instead of errs where appropriate. * Warn if env vars are set.

mirza-halilcevic added 11 commits January 15, 2026 20:50

Improve time and metadata tracking, output file format has changed.

9c0838e

Add support for stdin.

9f6c671

Revert to old output format.

7eb9851

Add state file for crash and interrupt recovery.

44ead00

Add --retry-failed option.

8a93456

Use proper python logger.

aeedb2a

Improve readability of output.

d54ab9e

Log warnings from tuning driver.

a2c338e

Reintroduce --quiet flag.

58bada8

Let important exceptions propagate and clean up code.

1cf2e36

Simplify state file and support multiple contexts.

110dccb

mirza-halilcevic requested a review from Copilot January 18, 2026 12:19

Copilot AI reviewed Jan 18, 2026

View reviewed changes

mirza-halilcevic added 3 commits January 18, 2026 12:25

Address copilot comments.

90132b2

Show tuning-driver output during failures.

b781b40

Add --status option.

056b12c

mirza-halilcevic marked this pull request as ready for review January 20, 2026 00:13

mirza-halilcevic requested a review from causten as a code owner January 20, 2026 00:13

mirza-halilcevic requested review from dhernandez0, dorde-antic, pabloantoniom and umangyadav January 20, 2026 00:14

mirza-halilcevic added 3 commits January 21, 2026 01:26

Merge remote-tracking branch 'origin/develop' into tuning-logging

baa2279

Improve order of logs for easier tracking.

7041c64

Update github ci python version.

8f4259e

mirza-halilcevic changed the title ~~tuningRunner improvements - Add state file for crash recovery and improve logging~~ [AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging Jan 23, 2026

mirza-halilcevic and others added 4 commits January 23, 2026 16:15

Merge branch 'develop' into tuning-logging

daacd66

Merge remote-tracking branch 'origin/develop' into tuning-logging

58f4f2f

Merge remote-tracking branch 'origin/develop' into tuning-logging

e6cb102

Fix state transitions.

aeb430c

mirza-halilcevic and others added 4 commits February 2, 2026 20:53

Add timeout option.

c39a3d2

Improve output file format.

2b6e394

Simplify output file writing.

f1a885f

Merge branch 'develop' into tuning-logging

03ffa08

mirza-halilcevic requested a review from djramic February 4, 2026 11:07

mirza-halilcevic mentioned this pull request Feb 4, 2026

Ignore unnecessary tuning output fields ROCm/MITuna#1026

Merged

1 task

mirza-halilcevic requested a review from Copilot February 4, 2026 14:33

Copilot AI reviewed Feb 4, 2026

View reviewed changes

dorde-antic approved these changes Feb 4, 2026

View reviewed changes

mirza-halilcevic and others added 15 commits February 5, 2026 00:48

Merge branch 'develop' into tuning-logging

aa6a586

Address code review comments.

b2dc435

Merge remote-tracking branch 'origin/tuning-logging' into tuning-logging

2b562ae

Use llvm dbgs instead of errs where appropriate.

d964457

Warn if env vars are set.

408100a

Merge remote-tracking branch 'origin/develop' into tuning-logging

cd459b2

Merge branch 'develop' into tuning-logging

54eec0c

Merge branch 'develop' into tuning-logging

7ef8977

Merge branch 'develop' into tuning-logging

73f5725

Merge branch 'develop' into tuning-logging

62dbb51

Merge branch 'develop' into tuning-logging

ef7ef50

Merge branch 'develop' into tuning-logging

80b79e4

Merge branch 'develop' into tuning-logging

9019bac

Merge branch 'develop' into tuning-logging

d40663e

Merge branch 'develop' into tuning-logging

94b59df

mirza-halilcevic merged commit 821465d into develop Feb 20, 2026
7 of 14 checks passed

mirza-halilcevic deleted the tuning-logging branch February 20, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging#2208

[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging#2208
mirza-halilcevic merged 40 commits intodevelopfrom
tuning-logging

mirza-halilcevic commented Jan 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dorde-antic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mirza-halilcevic commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dorde-antic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mirza-halilcevic commented Jan 18, 2026 •

edited

Loading