Skip to content

Add script for test execution times comparison#411

Draft
wmlynik wants to merge 33 commits intomasterfrom
wmlynik/tests_compare
Draft

Add script for test execution times comparison#411
wmlynik wants to merge 33 commits intomasterfrom
wmlynik/tests_compare

Conversation

@wmlynik
Copy link

@wmlynik wmlynik commented Aug 11, 2025

Description

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Chore (refactoring, style fixes, git/CI config, submodule management, no code logic changes)

How Has This Been Tested?

  • Already covered by automatic testing.
  • New test added: (add PR link here).
  • Tested by hand on: (list targets here).

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing linter checks and tests passed.
  • My changes generate no new compilation warnings for any of the targets.

Special treatment

  • This PR needs additional PRs to work (list the PRs, preferably in merge-order).
  • I will merge this PR by myself when appropriate.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @wmlynik, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've added a new Python script and a related configuration file to enhance our test analysis capabilities. The primary goal of this change is to provide a powerful tool for comparing test execution times between different runs, which will be invaluable for detecting performance changes. This new utility will allow us to quickly identify which tests have become slower or faster, and also track changes in test status or presence, streamlining our performance monitoring and debugging workflows.

Highlights

  • New Test Comparison Script: I've introduced a new Python script, cmp.py, which is a command-line utility designed to compare two JUnit XML test result files. This script is crucial for identifying performance regressions or improvements by analyzing the execution times of tests between different runs.
  • Flexible Filtering Options: The cmp.py script offers robust filtering capabilities, allowing users to focus comparisons on specific targets, directories, test suites, or individual test cases. This enables targeted analysis of performance changes within a large test suite.
  • Threshold-Based Performance Highlighting: The script can highlight significant performance changes by applying configurable absolute and relative thresholds. This means it can automatically flag tests that have increased or decreased in execution time beyond a specified percentage or absolute value, making it easier to spot critical changes.
  • Test Status and Failure Reporting: In addition to time comparisons, cmp.py can also report on differences in test status, such as tests that are present in one report but missing in another, or tests whose status has changed (e.g., from 'OK' to 'FAIL' or 'SKIP'). It can also list all failing tests from either report.
  • New Benchmark Definition File: I've added benchmarks/coremark_pro.yaml, a new YAML file that defines the structure and cases for the CoreMark-Pro benchmark. This file will likely be used by cmp.py to understand and filter test sets related to CoreMark-Pro, providing a structured way to manage benchmark definitions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script for comparing test execution times, along with a benchmark configuration file. The script is a great addition, but I've found several issues, including some critical bugs in the filtering logic that need to be addressed. I've also suggested improvements for maintainability and correctness in other parts of the script. Please review the comments for details.

@github-actions
Copy link

github-actions bot commented Aug 11, 2025

Unit Test Results

9 462 tests  +16   8 873 ✅ +16   51m 7s ⏱️ + 3m 16s
  561 suites +16     589 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 6638ab3. ± Comparison against base commit 7688d5f.

♻️ This comment has been updated with latest results.

@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch 3 times, most recently from 6f4315c to 0d7c1d4 Compare August 11, 2025 14:56
@wmlynik
Copy link
Author

wmlynik commented Aug 11, 2025

@gemini-code-assist review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script for comparing test execution times, which is a valuable addition for performance tracking. The script is comprehensive and well-structured. My review focuses on improving correctness, robustness, and maintainability by addressing a potential bug in argument parsing, refactoring complex and duplicated code, and improving type hints for better clarity and static analysis.

@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from 5d73f2e to 18653f5 Compare August 11, 2025 15:16
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch 2 times, most recently from 97097ab to bf092b4 Compare September 3, 2025 14:54
Copy link
Member

@nalajcie nalajcie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a draft, I've just pointed out some places the code might me improved upon - to make it more maintainable in the future.

I didn't look at the tests.

I'm not sure but maybe this tool could/should be put in a separate dir if it would help with extra tooling setup for auto-testing (mypy, tox, etc.). Not a requirement as we might just treat it as a development tool - but if we do have automatic tests, it might be good to run them in CI :)

import yaml
from junitparser import Error, Failure, JUnitXml, Skipped

ESCAPE_RED = "\033[31m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use colorama? this way it would be compatible also with eg. windows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colorama doesn't support italics and underline, because it's not supported by cmd.exe on windows. Windows Terminal (the default in modern versions of Windows) supports ANSI codes. Should we adjust the styling to maintain compatibility with legacy console?

return only_old + only_new + with_children


def find_fails(results, args, unfiltered=False, level=0, path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could probably be implemented inside Testcase / Testsuite class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how the entire function could be implemented in Testcase / Testsuite class as it starts recursing from target level. It could make sense to move the part of logic executed at testsuite level to Testsuite class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is really hard to follow, use classes as code decomposition tool (and for implementation details hiding) not only as a storage - in this particular example:

class Testsuite:
 # [...]
  def failures(self) -> something:
    """Returns failed testcases from this testsuite"""
    return [case for case in self.cases.values() if case.status == Status.FAIL]
  • If we want to filter something - we pass filter_fun(item) as an argument, not use global filtering function with some strange local variables as params
  • the Testcase should probably know it's fully-qualified name - that way filtering would be muuuch easier to implement (and more readable)

(the same approach should be done for other classes - move all code which depend only on the contents of the class to the class itself)



def find_fails(results, args, unfiltered=False, level=0, path=None):
if level == 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-coding max recursion level in multiple places, would be better to just always check if we have sub-results or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursion levels are hard-coded because the hierarchy has fixed set of levels and there are differences between the levels. It should be now clearer after replacing the numbers with an enum

else:
print(
f"{ESCAPE_BOLD}{ESCAPE_YELLOW}Warning:{ESCAPE_RESET} "
f"{count_fails(fails)} tests failed in the {name} file "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use real filename not an alias

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were talking about that for a while and we came to the conclusion that we want to keep the apprach with file1 and file2 (instead of old and new), because the file path may be too long and the file name may be the same - we can compare xmls with the same name from different directories.



if __name__ == "__main__":
main()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think when the program should exit with non-zero return code

print(f"Failed tests in {name} file:")
print_fails(fails_filtered, args)
else:
print(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a script for diff'ing 2 JUnit XML files - use-case wise - I would like to see that the test started (or stopped) failing without any extra params as this is crucial info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the test results are taken from the CI where fails are not intended - we are leaving the warning if any fail happens and even if there is a fail that occur in both XML files it's good to leave the info about that. We really want to make the output clear and readable, but providing that there are some fails (let's say some known issues) we they can take the most of the output. We wanted to focus on mostly one utility in this script, the one that is not supported yet so we chosen times comparison - about fails we know from ci-support chat. Please let us know whether do you agree on that approach, there is the approach of fails investigation:
Screenshot from 2025-09-16 17-34-15

@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from 71d449f to 3901410 Compare September 9, 2025 11:34
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch 3 times, most recently from 4f40784 to 1a753ad Compare September 17, 2025 11:40
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from ff476e7 to 0997a80 Compare October 23, 2025 15:03
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch 2 times, most recently from a98ec57 to 518e55c Compare October 28, 2025 11:51
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from abc542f to 86b02e8 Compare December 17, 2025 14:12
Władysław Młynik added 23 commits December 17, 2025 15:40
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from 86b02e8 to 2527c98 Compare December 17, 2025 14:40
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from 2527c98 to ee31bd2 Compare December 17, 2025 15:17
@wmlynik wmlynik force-pushed the wmlynik/tests_compare branch from ee31bd2 to 6638ab3 Compare December 17, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants