feat(verilator,gsim): add PGO_BOLT option#756
Conversation
poemonsense
left a comment
There was a problem hiding this comment.
Generally LGTM
Regarding a CI test, is it possible to use GitHub Runners for this?
There is no Besides, https://github.com/cyyself/ci-perf-test/actions/runs/19931456329/job/57144359206 https://github.com/cyyself/ci-perf-test/actions/runs/19931456329/job/57144359231 |
83382a6 to
36b34d9
Compare
|
I have resolved the dependency of |
4619cdb to
eabf2f4
Compare
|
Wait for OpenXiangShan/xs-env#68 to be published. |
Just see this comment. Then I think we can test it in DiffTest instead of adding the dependency to xs-env. We can test DiffTest in Ubuntu 24.04 and apt install the dependency. xs-env provides a docker for basic xiangshan support (not all dependencies). It should be simplified as we are using it in GitHub Action Runners which do not have much storage space. Maybe we can simply add a test here in DiffTest? Then let the user know they can use bolt (as added in this PR) |
Indeed. But now we have already merged the I have also modified the script to detect |
2a41930 to
1d709f3
Compare
The PGO data generated by hardware branch tracing directly with nearly zero runtime overhead to the already compiled binary, thus making the PGO build really quick (~1min), also avoiding the need for a full recompilation. This results in a much faster build process while still benefiting from the performance improvements provided by PGO. The results show that it only takes 1:27 with `PGO_BOLT=1` and `PGO_MAX_CYCLE=100000` to build the Verilator emulator with PGO, compared to 10+ minutes traditionally, and the performance is nearly the same, both finished CoreMark in ~26s on my 13900K with LLVM 22, while the non-PGO build takes 84s. This process requires Linux-perf to collect the profile data, and BOLT to apply the optimizations. When profiling with Linux-perf, please ensure that the system has set `sysctl -w kernel.perf_event_paranoid=-1` to allow perf to collect the necessary data. Signed-off-by: Yangyu Chen <cyy@cyyself.name>
|
If possible, write a short introduction for this feature and send it to BOSC as well as the README in DiffTest. Thanks. @klin02 I think we should improve the README to include more instructions on how to use DiffTest. When I was developing the footprint memory, I added the argument to the README. But we didn't ensure this in the past. Now with every new feature merged, we should check whether the README is updated. Basically the current README contains:
We should also include:
Please help improve the doc together. This will benefit open-source users and XiangShan developers in BOSC. |
The PGO data generated by hardware branch tracing directly with nearly zero runtime overhead to the already compiled binary, thus making the PGO build really quick (~1min), also avoiding the need for a full recompilation. This results in a much faster build process while still benefiting from the performance improvements provided by PGO.
The results show that it only takes 1:27 with
PGO_BOLT=1andPGO_MAX_CYCLE=100000to build the Verilator emulator with PGO, compared to 10+ minutes traditionally, and the performance is nearly the same, both finished CoreMark in ~26s on my 13900K with LLVM 22, while the non-PGO build takes 84s.This process requires Linux-perf to collect the profile data, and BOLT to apply the optimizations. When profiling with Linux-perf, please ensure that the system has set
sysctl -w kernel.perf_event_paranoid=-1to allow perf to collect the necessary data.