Run the Functional Tests in parallel #1868

dscho · 2025-09-15T17:56:26Z

From time to time, the Functional Tests fail, and re-running "fixes" the failures, i.e. those tests are flaky. When that happens, we have to invest a lot of time into the re-run: currently 35 minutes for the x86_64 jobs and 1 hour 15 minutes for the ARM64 ones. That's unwelcome friction.

Let's follow the playbook of git/git which runs its test suite (which, due to the heavy over-use of Bash scripting, is very, very slow on Windows) in parallel on Windows, slicing up the tests so that the individual test slices take roughly the same amount of time to run.

This PR is stacked on top of #1866 and will need its (empty) tip commit to be dropped to force GitHub to update the PR correctly (i.e. no longer show the commits of #1866 once they are merged).

dscho · 2025-09-16T08:23:00Z

Hmm. Something in those numbers does not add up:

Here is a non-parallelized run. It took 3h 35m 27s total time (even though the wall-clock time was "only" 1h 13m 41s.
Here is a parallelized run. It took 10h 30m 55s total time (even though the wall-clock time of the first run was only 19m 54s, I don't count the re-run because it skipped the build, of course).

The discrepancy is quite stark, and I cannot quite explain it. Sure, each of the now 40 instead of 4 matrix jobs duplicates the setup (including "Install product"), but by my counting that only takes around 1.5 minutes. And 36 * 1.5 = 54 minutes is much, much smaller than the ~7 hours difference of total time.

I cannot explain this discrepancy even when accounting for the difference of build times (4 and 6¼ vs 3 and 5 minutes for the Debug and Release builds, respectively), which likely indicates a busy neighbor problem where the runner VMs were slower during the parallelized run simply due to the fact that more people ran workflows during that time of day.

dscho · 2025-09-16T08:33:23Z

Here is a parallelized run. It took 10h 30m 55s total time

Ooooh, here is something interesting. That page lists every job twice! I suspect a bug that double-counts where there has been a re-run. Maybe GitHub double-bills jobs that have been skipped as part of a "re-run failed jobs" because they had not failed in the first place?

The only job for which two different times are posted is "Functional Tests (Debug, x86_64, 5)", which makes sense because that one was the failed job in the first attempt: 6m15s vs 6m22s, which is "same enough" for all practical purposes.

So now the numbers make more sense, at least after dividing 10h30m55s by 2: 5h15m28s. The difference in total run time to the non-parallelized run is now more in the vicinity of 1h40m = 100 minutes, and dividing that by the 36 additional jobs means that each of the additional jobs adds about 3 minutes to the total run time (which is still quite a bit more than the 1.5 minutes I estimated, even accounting for every "Run functional tests" step adding 20 seconds before actually running the tests and 20 seconds afterwards), but the remaining difference can be chalked up to "noisy neighbors in busy times slowing everyone else down".

dscho · 2025-09-16T08:40:01Z

One more data point to add: the current run (which just failed, as of time of writing, reports the following (5h 7m 49s):

Jobs/total run time

Job	Run time
Validation	27s
Build and Unit Test (Release)	3m 41s
Build and Unit Test (Debug)	5m 7s
Functional Tests (Debug, x86_64, 1)	3m 45s
Functional Tests (Debug, x86_64, 2)	6m 50s
Functional Tests (Debug, x86_64, 0)	5m 50s
Functional Tests (Debug, x86_64, 6)	5m 55s
Functional Tests (Debug, x86_64, 8)	5m 47s
Functional Tests (Debug, x86_64, 5)	5m 35s
Functional Tests (Debug, arm64, 6)	12m 52s
Functional Tests (Debug, arm64, 3)	6m 0s
Functional Tests (Debug, x86_64, 4)	5m 46s
Functional Tests (Debug, arm64, 1)	6m 15s
Functional Tests (Debug, arm64, 8)	9m 52s
Functional Tests (Debug, x86_64, 7)	5m 39s
Functional Tests (Debug, arm64, 5)	11m 27s
Functional Tests (Debug, x86_64, 9)	6m 45s
Functional Tests (Debug, x86_64, 3)	3m 31s
Functional Tests (Debug, arm64, 0)	10m 20s
Functional Tests (Debug, arm64, 4)	8m 54s
Functional Tests (Release, x86_64, 1)	3m 12s
Functional Tests (Release, x86_64, 0)	5m 50s
Functional Tests (Debug, arm64, 7)	10m 33s
Functional Tests (Debug, arm64, 9)	10m 24s
Functional Tests (Release, x86_64, 9)	5m 50s
Functional Tests (Release, x86_64, 2)	4m 52s
Functional Tests (Debug, arm64, 2)	9m 51s
Functional Tests (Release, x86_64, 3)	3m 39s
Functional Tests (Release, x86_64, 5)	6m 12s
Functional Tests (Release, x86_64, 8)	5m 58s
Functional Tests (Release, arm64, 6)	13m 29s
Functional Tests (Release, x86_64, 7)	6m 33s
Functional Tests (Release, arm64, 1)	5m 31s
Functional Tests (Release, arm64, 0)	10m 10s
Functional Tests (Release, x86_64, 4)	5m 21s
Functional Tests (Release, arm64, 2)	10m 20s
Functional Tests (Release, arm64, 3)	5m 54s
Functional Tests (Release, arm64, 9)	10m 16s
Functional Tests (Release, arm64, 4)	9m 23s
Functional Tests (Release, arm64, 8)	10m 16s
Functional Tests (Release, arm64, 5)	9m 45s
Functional Tests (Release, x86_64, 6)	5m 10s
Functional Tests (Release, arm64, 7)	9m 2s
Functional Tests	0s
Build, Unit and Functional Tests Successful	0s
	5h 7m 49s

mjcheetham · 2025-09-16T10:50:22Z

I wonder if it's also time to drop the Debug flag for Actions, since that is adding a lot of noise to the output.

This teaches a new command-line option to `GVFS.FunctionalTests.exe` that allows running only a slice of all tests at a time, via `--slice=<m>,<n>` where `m` is the zero-based index of the slice to run, and `n` is the total number of slices to split the tests into. The idea is to distribute the test cases evenly across the slices, so that running all slices in parallel will finish in about the same time as running all tests in a single slice. This is slightly tricky because the test cases of classes within the `EnlistmentPerFixture` namespace are known to rely on side effects of previous test cases within the same class. Therefore, all test cases of such a class must be run in the same slice. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

Following the example of git/git, which runs the Windows tests in parallel to compensate for a very long run time, we split the Functional Tests into 10 buckets and run them in parallel. This is particularly useful in light of some flaky tests that frequently need to be re-run. That way, the cost of re-running a failed test is reduced by virtue of only having to re-run a _slice_ of the tests. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho · 2025-09-16T12:50:03Z

I wonder if it's also time to drop the Debug flag for Actions, since that is adding a lot of noise to the output.

It is. Yet it helped me debug things ;-)

mjcheetham

Just a nit

mjcheetham · 2025-09-16T14:02:55Z

.github/workflows/build.yaml

+        architecture: [ x86_64, arm64 ]
+
+    steps:
+      - name: Report success! (for easier identification of successful runs in the Checks Required for Pull Requests)


Do we need such a long task name? :) Why not just Success.

Fixed: https://github.com/microsoft/VFSForGit/compare/e62ad853f3a037cf708cf53215528127951bf97a..040240bb1875f1226805a0ad366b39a0e90b4ccf

mjcheetham · 2025-09-16T14:03:57Z

.github/workflows/build.yaml

+    needs: [functional_test]
+
+    steps:
+      - name: Report success! (for easier identification of successful runs in the Checks Required for Pull Requests)


Fixed: https://github.com/microsoft/VFSForGit/compare/e62ad853f3a037cf708cf53215528127951bf97a..040240bb1875f1226805a0ad366b39a0e90b4ccf

In the GitHub UI, it is possible to specify Checks that need to be successful before a PR can be merged. These "Checks" are actually the names of jobs in the workflow. It would be much better to specify a required _workflow_, but only workflow _jobs_ can be specified. The Required Checks are currently: - Functional Tests (Debug, arm64) - Functional Tests (Debug, x86_64) - Functional Tests (Release, arm64) - Functional Tests (Release, x86_64) So let's make sure that there are job names that match the recorded names and that indicate a successful workflow run. An alternative would be to add a single job with a defined name that depends on all the other jobs, and that can compensate for the missing GitHub feature to require a whole workflow instead of individual jobs to have run successfully. But that would interfere with other PRs that are currently in flight, and will therefore have to wait its turn in a separate PR. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

GitHub has a feature called "Require status checks before merging": https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches#require-status-checks-before-merging What they actually mean by "status checks" is "successful workflow job runs". And what they _really_ actually mean is "_display labels_ of successful workflow job runs". That is, workflow job runs are identified by the label they display. Example: In VFSforGit, there is currently a "required status check" that reads "Functional Test (Debug, arm64)". This is the label that is displayed for the matrix job with vector {configuration: Debug, architecture: arm64} in the "functional_test" workflow job. This is quite restrictive! In VFSforGit, specifying those four matrix job labels is a _work-around_ for the _actual_ requirement, namely that the workflow defined in `build.yaml` succeeds. It just so happens that those four matrix jobs are the leaf jobs, i.e. when they all succeed, the workflow _run_ has succeeded. And vice versa, if the workflow run failed, at least one of those four matrix jobs must have failed or not even run. Now that I multiplied the matrix jobs even further by running the Functional Tests in parallel, the _display labels_ (and the number) of the matrix jobs has changed. As a consequence, to appease the "Require status checks before merging" rule, I had to add _another_ set of matrix jobs just to guarantee that the same four matrix job labels exist. This is silly, because those four matrix jobs are not needed at all for actually testing the code. They are just there to make GitHub happy. This commit prepares to change that. It adds a new workflow job that fits the bill "if the job succeeded, the workflow run must have succeeded as a hole, and vice versa". This new job will be made the only "required status check", once this here PR has been merged, and then we can remove the silly "duplicate" matrix jobs again. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho self-assigned this Sep 15, 2025

dscho force-pushed the run-functional-tests-in-parallel branch 2 times, most recently from 3ff37b9 to 1c0a86d Compare September 16, 2025 08:04

dscho force-pushed the run-functional-tests-in-parallel branch from 1c0a86d to 2b6940e Compare September 16, 2025 10:39

dscho added 2 commits September 16, 2025 13:04

dscho force-pushed the run-functional-tests-in-parallel branch from 2b6940e to e62ad85 Compare September 16, 2025 12:49

dscho marked this pull request as ready for review September 16, 2025 12:50

dscho enabled auto-merge September 16, 2025 13:17

mjcheetham requested changes Sep 16, 2025

View reviewed changes

dscho added 2 commits September 16, 2025 16:27

dscho force-pushed the run-functional-tests-in-parallel branch from e62ad85 to 040240b Compare September 16, 2025 14:28

mjcheetham approved these changes Sep 17, 2025

View reviewed changes

dscho merged commit f47a37a into master Sep 17, 2025
97 of 98 checks passed

dscho deleted the run-functional-tests-in-parallel branch September 17, 2025 08:51

mjcheetham mentioned this pull request Sep 18, 2025

Release M263 #1872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run the Functional Tests in parallel #1868

Run the Functional Tests in parallel #1868

Uh oh!

dscho commented Sep 15, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

mjcheetham commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

mjcheetham left a comment

Uh oh!

mjcheetham Sep 16, 2025

Uh oh!

dscho Sep 16, 2025

Uh oh!

mjcheetham Sep 16, 2025

Uh oh!

dscho Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Run the Functional Tests in parallel #1868

Run the Functional Tests in parallel #1868

Uh oh!

Conversation

dscho commented Sep 15, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

mjcheetham commented Sep 16, 2025

Uh oh!

dscho commented Sep 16, 2025

Uh oh!

mjcheetham left a comment

Choose a reason for hiding this comment

Uh oh!

mjcheetham Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

dscho Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

mjcheetham Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

dscho Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants