Add script for test execution times comparison#411
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @wmlynik, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've added a new Python script and a related configuration file to enhance our test analysis capabilities. The primary goal of this change is to provide a powerful tool for comparing test execution times between different runs, which will be invaluable for detecting performance changes. This new utility will allow us to quickly identify which tests have become slower or faster, and also track changes in test status or presence, streamlining our performance monitoring and debugging workflows.
Highlights
- New Test Comparison Script: I've introduced a new Python script,
cmp.py, which is a command-line utility designed to compare two JUnit XML test result files. This script is crucial for identifying performance regressions or improvements by analyzing the execution times of tests between different runs. - Flexible Filtering Options: The
cmp.pyscript offers robust filtering capabilities, allowing users to focus comparisons on specific targets, directories, test suites, or individual test cases. This enables targeted analysis of performance changes within a large test suite. - Threshold-Based Performance Highlighting: The script can highlight significant performance changes by applying configurable absolute and relative thresholds. This means it can automatically flag tests that have increased or decreased in execution time beyond a specified percentage or absolute value, making it easier to spot critical changes.
- Test Status and Failure Reporting: In addition to time comparisons,
cmp.pycan also report on differences in test status, such as tests that are present in one report but missing in another, or tests whose status has changed (e.g., from 'OK' to 'FAIL' or 'SKIP'). It can also list all failing tests from either report. - New Benchmark Definition File: I've added
benchmarks/coremark_pro.yaml, a new YAML file that defines the structure and cases for the CoreMark-Pro benchmark. This file will likely be used bycmp.pyto understand and filter test sets related to CoreMark-Pro, providing a structured way to manage benchmark definitions.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a new script for comparing test execution times, along with a benchmark configuration file. The script is a great addition, but I've found several issues, including some critical bugs in the filtering logic that need to be addressed. I've also suggested improvements for maintainability and correctness in other parts of the script. Please review the comments for details.
6f4315c to
0d7c1d4
Compare
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new script for comparing test execution times, which is a valuable addition for performance tracking. The script is comprehensive and well-structured. My review focuses on improving correctness, robustness, and maintainability by addressing a potential bug in argument parsing, refactoring complex and duplicated code, and improving type hints for better clarity and static analysis.
5d73f2e to
18653f5
Compare
97097ab to
bf092b4
Compare
nalajcie
left a comment
There was a problem hiding this comment.
I know this is just a draft, I've just pointed out some places the code might me improved upon - to make it more maintainable in the future.
I didn't look at the tests.
I'm not sure but maybe this tool could/should be put in a separate dir if it would help with extra tooling setup for auto-testing (mypy, tox, etc.). Not a requirement as we might just treat it as a development tool - but if we do have automatic tests, it might be good to run them in CI :)
| import yaml | ||
| from junitparser import Error, Failure, JUnitXml, Skipped | ||
|
|
||
| ESCAPE_RED = "\033[31m" |
There was a problem hiding this comment.
use colorama? this way it would be compatible also with eg. windows?
There was a problem hiding this comment.
Colorama doesn't support italics and underline, because it's not supported by cmd.exe on windows. Windows Terminal (the default in modern versions of Windows) supports ANSI codes. Should we adjust the styling to maintain compatibility with legacy console?
test_compare/cmp.py
Outdated
| return only_old + only_new + with_children | ||
|
|
||
|
|
||
| def find_fails(results, args, unfiltered=False, level=0, path=None): |
There was a problem hiding this comment.
could probably be implemented inside Testcase / Testsuite class
There was a problem hiding this comment.
I don't see how the entire function could be implemented in Testcase / Testsuite class as it starts recursing from target level. It could make sense to move the part of logic executed at testsuite level to Testsuite class.
There was a problem hiding this comment.
The code is really hard to follow, use classes as code decomposition tool (and for implementation details hiding) not only as a storage - in this particular example:
class Testsuite:
# [...]
def failures(self) -> something:
"""Returns failed testcases from this testsuite"""
return [case for case in self.cases.values() if case.status == Status.FAIL]- If we want to filter something - we pass
filter_fun(item)as an argument, not use global filtering function with some strange local variables as params - the
Testcaseshould probably know it's fully-qualified name - that way filtering would be muuuch easier to implement (and more readable)
(the same approach should be done for other classes - move all code which depend only on the contents of the class to the class itself)
test_compare/cmp.py
Outdated
|
|
||
|
|
||
| def find_fails(results, args, unfiltered=False, level=0, path=None): | ||
| if level == 3: |
There was a problem hiding this comment.
hard-coding max recursion level in multiple places, would be better to just always check if we have sub-results or not
There was a problem hiding this comment.
The recursion levels are hard-coded because the hierarchy has fixed set of levels and there are differences between the levels. It should be now clearer after replacing the numbers with an enum
test_compare/cmp.py
Outdated
| else: | ||
| print( | ||
| f"{ESCAPE_BOLD}{ESCAPE_YELLOW}Warning:{ESCAPE_RESET} " | ||
| f"{count_fails(fails)} tests failed in the {name} file " |
There was a problem hiding this comment.
we were talking about that for a while and we came to the conclusion that we want to keep the apprach with file1 and file2 (instead of old and new), because the file path may be too long and the file name may be the same - we can compare xmls with the same name from different directories.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
think when the program should exit with non-zero return code
| print(f"Failed tests in {name} file:") | ||
| print_fails(fails_filtered, args) | ||
| else: | ||
| print( |
There was a problem hiding this comment.
this is a script for diff'ing 2 JUnit XML files - use-case wise - I would like to see that the test started (or stopped) failing without any extra params as this is crucial info.
There was a problem hiding this comment.
As the test results are taken from the CI where fails are not intended - we are leaving the warning if any fail happens and even if there is a fail that occur in both XML files it's good to leave the info about that. We really want to make the output clear and readable, but providing that there are some fails (let's say some known issues) we they can take the most of the output. We wanted to focus on mostly one utility in this script, the one that is not supported yet so we chosen times comparison - about fails we know from ci-support chat. Please let us know whether do you agree on that approach, there is the approach of fails investigation:

71d449f to
3901410
Compare
4f40784 to
1a753ad
Compare
ff476e7 to
0997a80
Compare
a98ec57 to
518e55c
Compare
abc542f to
86b02e8
Compare
JIRA: CI-578
Older Python versions (before 3.12) don't support nesting the same kind of quotes in f-string JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
JIRA: CI-578
86b02e8 to
2527c98
Compare
JIRA: CI-578
2527c98 to
ee31bd2
Compare
ee31bd2 to
6638ab3
Compare
Description
Motivation and Context
Types of changes
How Has This Been Tested?
Checklist:
Special treatment