-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Reassess mad/MEs split in tmad timing measurements
There is clearly something fishy, that is obvious with ggttggg (but has some small signs in the other processes too): the "mad" part decreases from SIMD/none to higher SIMD exactly like the MEs do. I would tend to exclude that the Fortran is vectorized (also because the compiler flags that determine SIMD/none vs SIMD/avx2 are only in cudacpp I think).
Note also that this is VERY HIGH in c++/none and much lower in Fortran-only...
Most likely, there is some other part (in the bridge??) that is now wrongly attributed to Fortran "mad" and is instead C++ or CUDA.
It would be nice not only to assign it to the right component in timing, but also to speed this up...
See this for ggttggg
https://github.com/madgraph5/madgraph4gpu/blob/d72071a332b98f06dfd7cc6f625748143e8d4c50/epochX/cudacpp/tmad/summaryTable_ggttggg.txt
===========================================================================================================
| | mad | mad | mad | sa/brdg | sa/full |
-----------------------------------------------------------------------------------------------------------
| ggttggg | [sec] tot = mad + MEs | [TOT/sec] | [MEs/sec] | [MEs/sec] | [MEs/sec] |
===========================================================================================================
| nevt/grid | 8192 | 8192 | 8192 | 8192 | 8192 |
| nevt total | 90112 | 90112 | 90112 | 256*32*1 | 256*32*1 |
-----------------------------------------------------------------------------------------------------------
| FORTRAN | 1226.61 = 5.01 + 1221.60 | 7.35e+01 (= 1.0) | 7.38e+01 (= 1.0) | --- | --- |
| CPP/none | 1576.95 = 115.27 + 1461.68 | 5.71e+01 (x 0.8) | 6.16e+01 (x 0.8) | 7.46e+01 | 7.45e+01 |
| CPP/sse4 | 841.07 = 63.89 + 777.19 | 1.07e+02 (x 1.5) | 1.16e+02 (x 1.6) | 1.40e+02 | 1.40e+02 |
| CPP/avx2 | 412.51 = 33.40 + 379.11 | 2.18e+02 (x 3.0) | 2.38e+02 (x 3.2) | 2.89e+02 | 2.89e+02 |
| CPP/512y | 375.07 = 30.44 + 344.63 | 2.40e+02 (x 3.3) | 2.61e+02 (x 3.5) | 3.23e+02 | 3.22e+02 |
| CPP/512z | 342.04 = 31.18 + 310.86 | 2.63e+02 (x 3.6) | 2.90e+02 (x 3.9) | 3.14e+02 | 3.15e+02 |
| CUDA/8192 | 19.54 = 7.47 + 12.08 | 4.61e+03 (x62.8) | 7.46e+03 (x101.) | 7.47e+03 | 9.16e+03 |
===========================================================================================================
| nevt/grid | | 16384 | 16384 |
| nevt total | | 512*32*1 | 512*32*1 |
-------------- -------------------------
| CUDA/max | | 9.35e+03 | 9.52e+03 |
| | | | (x129.) |
============== =========================