Skip to content

Conversation

@dscho
Copy link
Member

@dscho dscho commented Sep 15, 2025

From time to time, the Functional Tests fail, and re-running "fixes" the failures, i.e. those tests are flaky. When that happens, we have to invest a lot of time into the re-run: currently 35 minutes for the x86_64 jobs and 1 hour 15 minutes for the ARM64 ones. That's unwelcome friction.

Let's follow the playbook of git/git which runs its test suite (which, due to the heavy over-use of Bash scripting, is very, very slow on Windows) in parallel on Windows, slicing up the tests so that the individual test slices take roughly the same amount of time to run.

This PR is stacked on top of #1866 and will need its (empty) tip commit to be dropped to force GitHub to update the PR correctly (i.e. no longer show the commits of #1866 once they are merged).

@dscho dscho self-assigned this Sep 15, 2025
@dscho dscho force-pushed the run-functional-tests-in-parallel branch 2 times, most recently from 3ff37b9 to 1c0a86d Compare September 16, 2025 08:04
@dscho
Copy link
Member Author

dscho commented Sep 16, 2025

Hmm. Something in those numbers does not add up:

  • Here is a non-parallelized run. It took 3h 35m 27s total time (even though the wall-clock time was "only" 1h 13m 41s.
  • Here is a parallelized run. It took 10h 30m 55s total time (even though the wall-clock time of the first run was only 19m 54s, I don't count the re-run because it skipped the build, of course).

The discrepancy is quite stark, and I cannot quite explain it. Sure, each of the now 40 instead of 4 matrix jobs duplicates the setup (including "Install product"), but by my counting that only takes around 1.5 minutes. And 36 * 1.5 = 54 minutes is much, much smaller than the ~7 hours difference of total time.

I cannot explain this discrepancy even when accounting for the difference of build times (4 and 6¼ vs 3 and 5 minutes for the Debug and Release builds, respectively), which likely indicates a busy neighbor problem where the runner VMs were slower during the parallelized run simply due to the fact that more people ran workflows during that time of day.

@dscho
Copy link
Member Author

dscho commented Sep 16, 2025

  • Here is a parallelized run. It took 10h 30m 55s total time

Ooooh, here is something interesting. That page lists every job twice! I suspect a bug that double-counts where there has been a re-run. Maybe GitHub double-bills jobs that have been skipped as part of a "re-run failed jobs" because they had not failed in the first place?

The only job for which two different times are posted is "Functional Tests (Debug, x86_64, 5)", which makes sense because that one was the failed job in the first attempt: 6m15s vs 6m22s, which is "same enough" for all practical purposes.

So now the numbers make more sense, at least after dividing 10h30m55s by 2: 5h15m28s. The difference in total run time to the non-parallelized run is now more in the vicinity of 1h40m = 100 minutes, and dividing that by the 36 additional jobs means that each of the additional jobs adds about 3 minutes to the total run time (which is still quite a bit more than the 1.5 minutes I estimated, even accounting for every "Run functional tests" step adding 20 seconds before actually running the tests and 20 seconds afterwards), but the remaining difference can be chalked up to "noisy neighbors in busy times slowing everyone else down".

@dscho
Copy link
Member Author

dscho commented Sep 16, 2025

One more data point to add: the current run (which just failed, as of time of writing, reports the following (5h 7m 49s):

Jobs/total run time
Job Run time
Validation 27s
Build and Unit Test (Release) 3m 41s
Build and Unit Test (Debug) 5m 7s
Functional Tests (Debug, x86_64, 1) 3m 45s
Functional Tests (Debug, x86_64, 2) 6m 50s
Functional Tests (Debug, x86_64, 0) 5m 50s
Functional Tests (Debug, x86_64, 6) 5m 55s
Functional Tests (Debug, x86_64, 8) 5m 47s
Functional Tests (Debug, x86_64, 5) 5m 35s
Functional Tests (Debug, arm64, 6) 12m 52s
Functional Tests (Debug, arm64, 3) 6m 0s
Functional Tests (Debug, x86_64, 4) 5m 46s
Functional Tests (Debug, arm64, 1) 6m 15s
Functional Tests (Debug, arm64, 8) 9m 52s
Functional Tests (Debug, x86_64, 7) 5m 39s
Functional Tests (Debug, arm64, 5) 11m 27s
Functional Tests (Debug, x86_64, 9) 6m 45s
Functional Tests (Debug, x86_64, 3) 3m 31s
Functional Tests (Debug, arm64, 0) 10m 20s
Functional Tests (Debug, arm64, 4) 8m 54s
Functional Tests (Release, x86_64, 1) 3m 12s
Functional Tests (Release, x86_64, 0) 5m 50s
Functional Tests (Debug, arm64, 7) 10m 33s
Functional Tests (Debug, arm64, 9) 10m 24s
Functional Tests (Release, x86_64, 9) 5m 50s
Functional Tests (Release, x86_64, 2) 4m 52s
Functional Tests (Debug, arm64, 2) 9m 51s
Functional Tests (Release, x86_64, 3) 3m 39s
Functional Tests (Release, x86_64, 5) 6m 12s
Functional Tests (Release, x86_64, 8) 5m 58s
Functional Tests (Release, arm64, 6) 13m 29s
Functional Tests (Release, x86_64, 7) 6m 33s
Functional Tests (Release, arm64, 1) 5m 31s
Functional Tests (Release, arm64, 0) 10m 10s
Functional Tests (Release, x86_64, 4) 5m 21s
Functional Tests (Release, arm64, 2) 10m 20s
Functional Tests (Release, arm64, 3) 5m 54s
Functional Tests (Release, arm64, 9) 10m 16s
Functional Tests (Release, arm64, 4) 9m 23s
Functional Tests (Release, arm64, 8) 10m 16s
Functional Tests (Release, arm64, 5) 9m 45s
Functional Tests (Release, x86_64, 6) 5m 10s
Functional Tests (Release, arm64, 7) 9m 2s
Functional Tests 0s
Build, Unit and Functional Tests Successful 0s
5h 7m 49s

@dscho dscho force-pushed the run-functional-tests-in-parallel branch from 1c0a86d to 2b6940e Compare September 16, 2025 10:39
@mjcheetham
Copy link
Member

I wonder if it's also time to drop the Debug flag for Actions, since that is adding a lot of noise to the output.

This teaches a new command-line option to `GVFS.FunctionalTests.exe`
that allows running only a slice of all tests at a time, via
`--slice=<m>,<n>` where `m` is the zero-based index of the slice
to run, and `n` is the total number of slices to split the tests into.

The idea is to distribute the test cases evenly across the slices,
so that running all slices in parallel will finish in about the same
time as running all tests in a single slice.

This is slightly tricky because the test cases of classes within the
`EnlistmentPerFixture` namespace are known to rely on side effects of
previous test cases within the same class. Therefore, all test cases
of such a class must be run in the same slice.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Following the example of git/git, which runs the Windows tests in
parallel to compensate for a very long run time, we split the Functional
Tests into 10 buckets and run them in parallel.

This is particularly useful in light of some flaky tests that frequently
need to be re-run. That way, the cost of re-running a failed test is
reduced by virtue of only having to re-run a _slice_ of the tests.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho force-pushed the run-functional-tests-in-parallel branch from 2b6940e to e62ad85 Compare September 16, 2025 12:49
@dscho
Copy link
Member Author

dscho commented Sep 16, 2025

I wonder if it's also time to drop the Debug flag for Actions, since that is adding a lot of noise to the output.

It is. Yet it helped me debug things ;-)

@dscho dscho marked this pull request as ready for review September 16, 2025 12:50
@dscho dscho enabled auto-merge September 16, 2025 13:17
Copy link
Member

@mjcheetham mjcheetham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit

architecture: [ x86_64, arm64 ]

steps:
- name: Report success! (for easier identification of successful runs in the Checks Required for Pull Requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such a long task name? :) Why not just Success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs: [functional_test]

steps:
- name: Report success! (for easier identification of successful runs in the Checks Required for Pull Requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the GitHub UI, it is possible to specify Checks that need to be
successful before a PR can be merged. These "Checks" are actually the
names of jobs in the workflow. It would be much better to specify a
required _workflow_, but only workflow _jobs_ can be specified.

The Required Checks are currently:

  - Functional Tests (Debug, arm64)
  - Functional Tests (Debug, x86_64)
  - Functional Tests (Release, arm64)
  - Functional Tests (Release, x86_64)

So let's make sure that there are job names that match the recorded
names and that indicate a successful workflow run.

An alternative would be to add a single job with a defined name that
depends on all the other jobs, and that can compensate for the missing
GitHub feature to require a whole workflow instead of individual jobs
to have run successfully. But that would interfere with other PRs that
are currently in flight, and will therefore have to wait its turn in
a separate PR.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
GitHub has a feature called "Require status checks before merging":
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches#require-status-checks-before-merging

What they actually mean by "status checks" is "successful workflow job
runs". And what they _really_ actually mean is "_display labels_ of
successful workflow job runs". That is, workflow job runs are identified
by the label they display.

Example: In VFSforGit, there is currently a "required status check"
that reads "Functional Test (Debug, arm64)". This is the label that is
displayed for the matrix job with vector {configuration: Debug,
architecture: arm64} in the "functional_test" workflow job.

This is quite restrictive! In VFSforGit, specifying those four matrix
job labels is a _work-around_ for the _actual_ requirement, namely that
the workflow defined in `build.yaml` succeeds. It just so happens that
those four matrix jobs are the leaf jobs, i.e. when they all succeed,
the workflow _run_ has succeeded. And vice versa, if the workflow run
failed, at least one of those four matrix jobs must have failed or not
even run.

Now that I multiplied the matrix jobs even further by running the
Functional Tests in parallel, the _display labels_ (and the number) of
the matrix jobs has changed. As a consequence, to appease the "Require
status checks before merging" rule, I had to add _another_ set of
matrix jobs just to guarantee that the same four matrix job labels
exist. This is silly, because those four matrix jobs are not needed at
all for actually testing the code. They are just there to make GitHub
happy.

This commit prepares to change that. It adds a new workflow job
that fits the bill "if the job succeeded, the workflow run must have
succeeded as a hole, and vice versa". This new job will be made the
only "required status check", once this here PR has been merged, and
then we can remove the silly "duplicate" matrix jobs again.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho force-pushed the run-functional-tests-in-parallel branch from e62ad85 to 040240b Compare September 16, 2025 14:28
@dscho dscho merged commit f47a37a into master Sep 17, 2025
97 of 98 checks passed
@dscho dscho deleted the run-functional-tests-in-parallel branch September 17, 2025 08:51
@mjcheetham mjcheetham mentioned this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants