[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging#2208
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the tuning infrastructure with state file management for crash recovery and comprehensive logging improvements. The changes enable the tuner to track configuration states (running, failed, crashed, interrupted), persist them across runs, and recover gracefully from interruptions or crashes.
Changes:
- Added JSON state file mechanism to track tuning progress and enable crash recovery
- Introduced structured logging with color-coded output and tqdm integration
- Enhanced error reporting with detailed context and formatted output
- Added
--retry-failedflag to selectively retry failed/crashed configs - Improved progress tracking with ETA estimation based on median completion times
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/utils/performance/tuningRunner.py | Core implementation of state management, logging infrastructure, and enhanced error handling |
| mlir/utils/performance/perfRunner.py | Simplified tuning database reader to handle variable column counts |
| mlir/utils/jenkins/Jenkinsfile.downstream | Removed --quiet flag from CI tuning commands |
| mlir/utils/jenkins/Jenkinsfile | Removed --quiet flag from fusion tuning commands |
| mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp | Removed obsolete comment about hidden warning |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ecovery and improve logging (#2208) * Improve time and metadata tracking, output file format has changed. * Add support for stdin. * Revert to old output format. * Add state file for crash and interrupt recovery. * Add --retry-failed option. * Use proper python logger. * Improve readability of output. * Log warnings from tuning driver. * Reintroduce --quiet flag. * Let important exceptions propagate and clean up code. * Simplify state file and support multiple contexts. * Address copilot comments. * Show tuning-driver output during failures. * Add --status option. * Improve order of logs for easier tracking. * Update github ci python version. * Fix state transitions. * Add timeout option. * Improve output file format. * Simplify output file writing. * Address code review comments. * Use llvm dbgs instead of errs where appropriate. * Warn if env vars are set.
Motivation
Improve crash recovery, informational output, and error reporting.
Technical Details
Test Plan
This branch was used to create the tuning databases from which the quick-tune lists in #2212 were generated.
Submission Checklist