Skip to content

feat(verilator,gsim): add PGO_BOLT option#756

Merged
poemonsense merged 1 commit intomasterfrom
tmp_pgo_bolt
Dec 5, 2025
Merged

feat(verilator,gsim): add PGO_BOLT option#756
poemonsense merged 1 commit intomasterfrom
tmp_pgo_bolt

Conversation

@cyyself
Copy link
Member

@cyyself cyyself commented Dec 4, 2025

The PGO data generated by hardware branch tracing directly with nearly zero runtime overhead to the already compiled binary, thus making the PGO build really quick (~1min), also avoiding the need for a full recompilation. This results in a much faster build process while still benefiting from the performance improvements provided by PGO.

The results show that it only takes 1:27 with PGO_BOLT=1 and PGO_MAX_CYCLE=100000 to build the Verilator emulator with PGO, compared to 10+ minutes traditionally, and the performance is nearly the same, both finished CoreMark in ~26s on my 13900K with LLVM 22, while the non-PGO build takes 84s.

This process requires Linux-perf to collect the profile data, and BOLT to apply the optimizations. When profiling with Linux-perf, please ensure that the system has set sysctl -w kernel.perf_event_paranoid=-1 to allow perf to collect the necessary data.

@cyyself cyyself requested a review from poemonsense December 4, 2025 09:17
poemonsense
poemonsense previously approved these changes Dec 4, 2025
Copy link
Member

@poemonsense poemonsense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Regarding a CI test, is it possible to use GitHub Runners for this?

@cyyself
Copy link
Member Author

cyyself commented Dec 4, 2025

Generally LGTM

Regarding a CI test, is it possible to use GitHub Runners for this?

There is no llvm-bolt from the official Ubuntu 22.04 repo. Thus, the simplest way to have llvm-bolt is to bump the base image of our run env, or we will need to self-compile the LLVM and install it in our env.

Besides, perf is available in GitHub CI. I have tested perf stat and perf record, and neither of them is supported:

https://github.com/cyyself/ci-perf-test/actions/runs/19931456329/job/57144359206

https://github.com/cyyself/ci-perf-test/actions/runs/19931456329/job/57144359231

@cyyself cyyself force-pushed the tmp_pgo_bolt branch 2 times, most recently from 83382a6 to 36b34d9 Compare December 4, 2025 15:19
@cyyself
Copy link
Member Author

cyyself commented Dec 4, 2025

I have resolved the dependency of linux-perf by providing a fallback path using bolt instrumentation. It's still much faster than using -fprofile-generate, which only takes 4:37 on my local 13900K with LLVM-22. But here we need a way to install llvm-bolt in CI to test it. Should we bump the CI env to Ubuntu 24.04?

@cyyself cyyself force-pushed the tmp_pgo_bolt branch 2 times, most recently from 4619cdb to eabf2f4 Compare December 4, 2025 15:28
@cyyself
Copy link
Member Author

cyyself commented Dec 4, 2025

Wait for OpenXiangShan/xs-env#68 to be published.

@poemonsense
Copy link
Member

I have resolved the dependency of linux-perf by providing a fallback path using bolt instrumentation. It's still much faster than using -fprofile-generate, which only takes 4:37 on my local 13900K with LLVM-22. But here we need a way to install llvm-bolt in CI to test it. Should we bump the CI env to Ubuntu 24.04?

Just see this comment.

Then I think we can test it in DiffTest instead of adding the dependency to xs-env. We can test DiffTest in Ubuntu 24.04 and apt install the dependency.

xs-env provides a docker for basic xiangshan support (not all dependencies). It should be simplified as we are using it in GitHub Action Runners which do not have much storage space.

Maybe we can simply add a test here in DiffTest? Then let the user know they can use bolt (as added in this PR)

@cyyself
Copy link
Member Author

cyyself commented Dec 4, 2025

I have resolved the dependency of linux-perf by providing a fallback path using bolt instrumentation. It's still much faster than using -fprofile-generate, which only takes 4:37 on my local 13900K with LLVM-22. But here we need a way to install llvm-bolt in CI to test it. Should we bump the CI env to Ubuntu 24.04?

Just see this comment.

Then I think we can test it in DiffTest instead of adding the dependency to xs-env. We can test DiffTest in Ubuntu 24.04 and apt install the dependency.

xs-env provides a docker for basic xiangshan support (not all dependencies). It should be simplified as we are using it in GitHub Action Runners which do not have much storage space.

Maybe we can simply add a test here in DiffTest? Then let the user know they can use bolt (as added in this PR)

Indeed. But now we have already merged the apt install llvm-bolt in xs-env, and I have also written a test here, thus just waiting for it to be published, and then rerunning the CI should be OK.

I have also modified the script to detect llvm-bolt. When available, it will set PGO_BOLT=1 to enable seamless use after bumping the difftest.

poemonsense
poemonsense previously approved these changes Dec 4, 2025
@cyyself cyyself force-pushed the tmp_pgo_bolt branch 2 times, most recently from 2a41930 to 1d709f3 Compare December 4, 2025 18:37
The PGO data generated by hardware branch tracing directly with nearly
zero runtime overhead to the already compiled binary, thus making the
PGO build really quick (~1min), also avoiding the need for a full
recompilation. This results in a much faster build process while still
benefiting from the performance improvements provided by PGO.

The results show that it only takes 1:27 with  `PGO_BOLT=1` and
`PGO_MAX_CYCLE=100000` to build the Verilator emulator with PGO,
compared to 10+ minutes traditionally, and the performance is nearly the
same, both finished CoreMark in ~26s on my 13900K with LLVM 22, while
the non-PGO build takes 84s.

This process requires Linux-perf to collect the profile data, and BOLT
to apply the optimizations. When profiling with Linux-perf, please
ensure that the system has set `sysctl -w kernel.perf_event_paranoid=-1`
to allow perf to collect the necessary data.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>
@poemonsense
Copy link
Member

If possible, write a short introduction for this feature and send it to BOSC as well as the README in DiffTest. Thanks.

@klin02 I think we should improve the README to include more instructions on how to use DiffTest. When I was developing the footprint memory, I added the argument to the README. But we didn't ensure this in the past. Now with every new feature merged, we should check whether the README is updated.

Basically the current README contains:

  • support for different CPUs (how to adapt it to a new CPU), including the mill/scala integration and DiffTest Bundle interfaces.

  • Run-time arguments: only footprint memory now

  • some plugins for DiffTest

We should also include:

  • support for different RTL simulators: software simulators (Verilator, GSIM, VCS, GalaxSim), Emulators (Cadence Palladium), FPGAs (Xilinx FPGAs)

  • build-time arguments (those defined in Makefile)

  • more run-time arguments (as well as their support on different platforms)

  • different checkers in DiffTest (which I will refactor them into multiple, separated files in the near future)

Please help improve the doc together. This will benefit open-source users and XiangShan developers in BOSC.

@poemonsense poemonsense merged commit 93007b7 into master Dec 5, 2025
5 checks passed
@poemonsense poemonsense deleted the tmp_pgo_bolt branch December 5, 2025 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants