Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
9f0a76e
[hack_ihel4p2] ignore perf.data* in epochX/cudacpp/.gitignore
valassi Oct 20, 2025
f67f27e
[csm] add two patches (derived from branch paper25v2) for instrumenti…
valassi Nov 22, 2025
e56ef48
[csm] gg_ttggg.mad: instrument color sums with timers using patchS an…
valassi Nov 30, 2025
5167d5a
[csm] add PAPER25/colortimer.sh from branch paper25v2 (commit cd5d62860)
valassi Dec 6, 2025
c795cd5
[csm] PAPER25/colortimer.sh: add ggttggg SIMD scans with skipCuda opt…
valassi Dec 6, 2025
e9d80be
[csm] PAPER25/colortimer.sh: run PAPER25/simdparser.py to produce a s…
valassi Dec 6, 2025
a9b52d2
[csm] add raw and summary results from gg_ttggg on gold91 (using upst…
valassi Dec 6, 2025
ccc12b1
[csm] CODEGEN color_sum.cc patch1 (for colorsum mixed SIMD #1072): us…
valassi Nov 24, 2025
f8a9c96
[csm] regenerate gg_ttggg.mad with patch1 (use fptype2 deltaMEs insid…
valassi Dec 6, 2025
00bcbb0
[csm] rerun ggttggg SIMD tests with patch1 (use fptype2 deltaMEs insi…
valassi Dec 6, 2025
2743b00
[csm] gg_ttggg.mad color_sum.cc patch2a: precompute jampR_sv also for…
valassi Dec 6, 2025
d268a7a
[csm] retest ggttggg with patch2a: precompute jampR_sv also for mixed…
valassi Dec 6, 2025
6d59bbd
[csm] gg_ttggg.mad color_sum.cc patch2b: precompute jampR_sv for mixe…
valassi Dec 6, 2025
06a832a
[csm] retest ggttggg with patch2b: precompute jampR_sv for mixed/nosi…
valassi Dec 6, 2025
d7bdf46
[csm] gg_ttggg.mad color_sum.cc patch2c: precompute jampR_sv for mixe…
valassi Dec 7, 2025
804bb62
[csm] retest ggttggg with patch2c: precompute jampR_sv for mixed/nosi…
valassi Dec 7, 2025
17c72bd
[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable…
valassi Dec 7, 2025
f74d34b
[csm] gg_ttggg.mad colorsum revert TEST1 code/results
valassi Dec 7, 2025
7a23fb2
[csm] gg_ttggg.mad colorsum TEST2 code/results (will revert): disable…
valassi Dec 7, 2025
0099352
[csm] gg_ttggg.mad color_sum.cc complete patch2 (comment out TEST2 co…
valassi Dec 7, 2025
96995b5
[csm] retest ggttggg with patch2: precompute jampR_sv for m/nosimd an…
valassi Dec 7, 2025
d195f21
[csm] CODEGEN color_sum.cc patch2 (for colorsum mixed SIMD #1072): pr…
valassi Dec 7, 2025
9c20f96
[csm] regenerate gg_ttggg.mad with patch2 (precompute jampR_sv) and a…
valassi Dec 7, 2025
f54ec8f
[csm] gg_ttggg.mad: move fpvsplit/fpvmerge to a separate mgOnGpuVecto…
valassi Dec 7, 2025
c27d724
[csm] CODEGEN: move fpvsplit/fpvmerge to a separate mgOnGpuVectorsSpl…
valassi Dec 7, 2025
7ad51cc
[csm] regenerate gg_ttggg.mad with mgOnGpuVectorsSplitMerge.h and add…
valassi Dec 7, 2025
8bf25fd
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: reimplement fpvmerge u…
valassi Dec 6, 2025
dc06151
[csm] ggttggg results using intrinsics for fpvmerge: ~1% better in 51…
valassi Dec 7, 2025
83dc09b
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: test scalar fpvmerge
valassi Dec 7, 2025
b4ce7a2
[csm] ggttggg results using scalar fpvmerge: as fast as initlist? (au…
valassi Dec 7, 2025
819f8ca
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: test scalar fpvmerge, …
valassi Dec 7, 2025
2130953
[csm] ggttggg results using scalar fpvmerge without autovectorization…
valassi Dec 7, 2025
a1e372e
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: default initlist fpvme…
valassi Dec 7, 2025
6fe60f9
[csm] ggttggg results using fpvmerge/initlist without autovectorizati…
valassi Dec 7, 2025
4da5f64
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: back to defaults (init…
valassi Dec 7, 2025
99d19a4
[csm] ggttggg results using defaults (fpvmerge/initlist with autovect…
valassi Dec 7, 2025
32af60a
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: reimplement fpvmerge u…
valassi Dec 7, 2025
a9ac24d
[csm] ggttggg results using fpvmerge/experimentalSIMD: clearly worse …
valassi Dec 7, 2025
cf4293c
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: back to default initli…
valassi Dec 7, 2025
cdc75a7
[csm] ggttggg results using defaults again (fpvmerge/initlist with au…
valassi Dec 7, 2025
570c54d
[csm] CODEGEN mgOnGpuVectorsSplitMerge.h: clean up fpvmerge, add intr…
valassi Dec 7, 2025
2a4ff80
[csm] CODEGEN mgOnGpuVectorsSplitMerge.h: fix clang-format for unions
valassi Dec 7, 2025
8b32815
[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: fix clang-format for u…
valassi Dec 7, 2025
2ee77e5
[csm] regenerate gg_ttggg.mad with final mgOnGpuVectorsSplitMerge.h a…
valassi Dec 7, 2025
240b5f5
[csm] TMP ggttggg code/results using fpvmerge/initlist but without au…
valassi Dec 7, 2025
59db527
[csm] back to ggttggg code/results using defaults
valassi Dec 7, 2025
738f362
[csm] TMP ggttggg code/results using upstream/master but without auto…
valassi Dec 7, 2025
ef121ba
[csm] back to ggttggg code/results using defaults
valassi Dec 7, 2025
db5093f
[csm] CLEANUP: move to PAPER25 the two patches for instrumenting colo…
valassi Dec 7, 2025
0d942d4
[csm] CLEANUP: remove the PAPER25 directory
valassi Dec 7, 2025
e6a139e
[csm] regenerate all processes with colorsum/simd patches and a separ…
valassi Dec 7, 2025
ca36ab7
[csm] rerun 138 tput tests on LUMI - all ok
valassi Dec 7, 2025
8eaabcb
[csm] rerun 30 tmad tests on LUMI - all ok
valassi Dec 7, 2025
1ba0e92
[csm] go back from csm/LUMI to hack_ihel3p1/itscrd90 logs
valassi Dec 11, 2025
967e077
[csm] rerun 144 tput tests on itscrd90 - all ok
valassi Dec 7, 2025
d3ee3cb
[csm] ** COMPLETE CSM ** rerun 30 tmad tests on itscrd90 - all ok
valassi Dec 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions epochX/cudacpp/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ run_[0-9]*
events.lhe*

py3_model.pkl

perf.data*
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,16 @@
// Created by: A. Valassi (Sep 2025) for the MG5aMC CUDACPP plugin.
// Further modified by: A. Valassi (2025) for the MG5aMC CUDACPP plugin.

#include "mgOnGpuConfig.h"

// For tests: disable autovectorization in gcc (in the cppnone mode only)
//#ifndef MGONGPU_CPPSIMD
//#pragma GCC optimize("no-tree-vectorize")
//#endif

#include "color_sum.h"

#include "mgOnGpuConfig.h"
#include "mgOnGpuVectorsSplitMerge.h"

#include "MemoryAccessMatrixElements.h"

Expand Down Expand Up @@ -88,60 +95,69 @@ namespace mg5amcCpu
// and also use constexpr to compute "2*" and "/colorDenom[icol]" once and for all at compile time:
// we gain (not a factor 2...) in speed here as we only loop over the up diagonal part of the matrix.
// Strangely, CUDA is slower instead, so keep the old implementation for the moment.
fptype_sv deltaMEs = { 0 };
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
fptype_sv deltaMEs_next = { 0 };
// Mixed mode: merge two neppV vectors into one neppV2 vector
fptype2_sv deltaMEs2 = { 0 };
#if not defined MGONGPU_CPPSIMD or ( defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT )
// Mixed mode: must convert from double to float and possibly merge SIMD vectors
// Double/float mode without SIMD: pre-create jampR_sv/jampI_sv vectors (faster and more robust)
fptype2_sv jampR_sv[ncolor];
fptype2_sv jampI_sv[ncolor];
for( int icol = 0; icol < ncolor; icol++ )
{
#if defined MGONGPU_CPPSIMD
// Mixed mode with SIMD: merge two neppV double vectors into one neppV2 float vector
jampR_sv[icol] = fpvmerge( cxreal( allJamp_sv[icol] ), cxreal( allJamp_sv[ncolor + icol] ) );
jampI_sv[icol] = fpvmerge( cximag( allJamp_sv[icol] ), cximag( allJamp_sv[ncolor + icol] ) );
#else
// Mixed mode without SIMD: convert double to float
// Double/float mode without SIMD: pre-create jampR_sv/jampI_sv vectors (faster and more robust)
jampR_sv[icol] = cxreal( allJamp_sv[icol] );
jampI_sv[icol] = cximag( allJamp_sv[icol] );
#endif
}
#else
// Double/float mode with SIMD: do not pre-create jampR_sv/jampI_sv vectors (would be slower)
const cxtype_sv* jamp_sv = allJamp_sv;
#endif
// Loop over icol
for( int icol = 0; icol < ncolor; icol++ )
{
// Diagonal terms
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
fptype2_sv& jampRi_sv = jampR_sv[icol];
fptype2_sv& jampIi_sv = jampI_sv[icol];
#if not defined MGONGPU_CPPSIMD or ( defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT )
const fptype2_sv& jampRi_sv = jampR_sv[icol];
const fptype2_sv& jampIi_sv = jampI_sv[icol];
#else
fptype2_sv jampRi_sv = (fptype2_sv)( cxreal( jamp_sv[icol] ) );
fptype2_sv jampIi_sv = (fptype2_sv)( cximag( jamp_sv[icol] ) );
const fptype2_sv& jampRi_sv = cxreal( jamp_sv[icol] );
const fptype2_sv& jampIi_sv = cximag( jamp_sv[icol] );
#endif
fptype2_sv ztempR_sv = cf2.value[icol][icol] * jampRi_sv;
fptype2_sv ztempI_sv = cf2.value[icol][icol] * jampIi_sv;
// Loop over jcol
for( int jcol = icol + 1; jcol < ncolor; jcol++ )
{
// Off-diagonal terms
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
fptype2_sv& jampRj_sv = jampR_sv[jcol];
fptype2_sv& jampIj_sv = jampI_sv[jcol];
#if not defined MGONGPU_CPPSIMD or ( defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT )
const fptype2_sv& jampRj_sv = jampR_sv[jcol];
const fptype2_sv& jampIj_sv = jampI_sv[jcol];
#else
fptype2_sv jampRj_sv = (fptype2_sv)( cxreal( jamp_sv[jcol] ) );
fptype2_sv jampIj_sv = (fptype2_sv)( cximag( jamp_sv[jcol] ) );
const fptype2_sv& jampRj_sv = cxreal( jamp_sv[jcol] );
const fptype2_sv& jampIj_sv = cximag( jamp_sv[jcol] );
#endif
ztempR_sv += cf2.value[icol][jcol] * jampRj_sv;
ztempI_sv += cf2.value[icol][jcol] * jampIj_sv;
}
fptype2_sv deltaMEs2 = ( jampRi_sv * ztempR_sv + jampIi_sv * ztempI_sv ); // may underflow #831
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
deltaMEs += fpvsplit0( deltaMEs2 );
deltaMEs_next += fpvsplit1( deltaMEs2 );
#else
deltaMEs += deltaMEs2;
#endif
deltaMEs2 += ( jampRi_sv * ztempR_sv + jampIi_sv * ztempI_sv ); // may underflow #831
}
// *** STORE THE RESULTS ***
using E_ACCESS = HostAccessMatrixElements; // non-trivial access: buffer includes all events
fptype* MEs = E_ACCESS::ieventAccessRecord( allMEs, ievt0 );
// NB: color_sum ADDS |M|^2 for one helicity to the running sum of |M|^2 over helicities for the given event(s)
fptype_sv& MEs_sv = E_ACCESS::kernelAccess( MEs );
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
fptype_sv deltaMEs = fpvsplit0( deltaMEs2 );
fptype_sv deltaMEs_next = fpvsplit1( deltaMEs2 );
#else
fptype_sv deltaMEs = deltaMEs2;
#endif
MEs_sv += deltaMEs; // fix #435
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
fptype* MEs_next = E_ACCESS::ieventAccessRecord( allMEs, ievt0 + neppV );
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// Copyright (C) 2020-2024 CERN and UCLouvain.
// Copyright (C) 2020-2025 CERN and UCLouvain.
// Licensed under the GNU Lesser General Public License (version 3 or later).
// Created by: A. Valassi (Nov 2020) for the MG5aMC CUDACPP plugin.
// Further modified by: S. Roiser, A. Valassi, Z. Wettersten (2020-2024) for the MG5aMC CUDACPP plugin.
// Further modified by: S. Roiser, A. Valassi, Z. Wettersten (2020-2025) for the MG5aMC CUDACPP plugin.

#ifndef MGONGPUVECTORS_H
#define MGONGPUVECTORS_H 1
Expand Down Expand Up @@ -744,92 +744,6 @@ namespace mg5amcCpu

#endif // #ifdef MGONGPU_CPPSIMD

//--------------------------------------------------------------------------

// Functions and operators for fptype2_v

#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT

inline fptype2_v
fpvmerge( const fptype_v& v1, const fptype_v& v2 )
{
// This code is not very efficient! It makes mixed precision FFV/color not faster than double on C++ (#537).
// I considered various alternatives, including
// - in gcc12 and clang, __builtin_shufflevector (works with different vector lengths, BUT the same fptype...)
// - casting vector(4)double to vector(4)float and then assigning via reinterpret_cast... but how to do the cast?
// Probably the best solution is intrinsics?
// - see https://stackoverflow.com/questions/5139363
// - see https://stackoverflow.com/questions/54518744
/*
fptype2_v out;
for( int ieppV = 0; ieppV < neppV; ieppV++ )
{
out[ieppV] = v1[ieppV];
out[ieppV+neppV] = v2[ieppV];
}
return out;
*/
#if MGONGPU_CPPSIMD == 2
fptype2_v out =
{ (fptype2)v1[0], (fptype2)v1[1], (fptype2)v2[0], (fptype2)v2[1] };
#elif MGONGPU_CPPSIMD == 4
fptype2_v out =
{ (fptype2)v1[0], (fptype2)v1[1], (fptype2)v1[2], (fptype2)v1[3], (fptype2)v2[0], (fptype2)v2[1], (fptype2)v2[2], (fptype2)v2[3] };
#elif MGONGPU_CPPSIMD == 8
fptype2_v out =
{ (fptype2)v1[0], (fptype2)v1[1], (fptype2)v1[2], (fptype2)v1[3], (fptype2)v1[4], (fptype2)v1[5], (fptype2)v1[6], (fptype2)v1[7], (fptype2)v2[0], (fptype2)v2[1], (fptype2)v2[2], (fptype2)v2[3], (fptype2)v2[4], (fptype2)v2[5], (fptype2)v2[6], (fptype2)v2[7] };
#endif
return out;
}

inline fptype_v
fpvsplit0( const fptype2_v& v )
{
/*
fptype_v out = {}; // see #594
for( int ieppV = 0; ieppV < neppV; ieppV++ )
{
out[ieppV] = v[ieppV];
}
*/
#if MGONGPU_CPPSIMD == 2
fptype_v out =
{ (fptype)v[0], (fptype)v[1] };
#elif MGONGPU_CPPSIMD == 4
fptype_v out =
{ (fptype)v[0], (fptype)v[1], (fptype)v[2], (fptype)v[3] };
#elif MGONGPU_CPPSIMD == 8
fptype_v out =
{ (fptype)v[0], (fptype)v[1], (fptype)v[2], (fptype)v[3], (fptype)v[4], (fptype)v[5], (fptype)v[6], (fptype)v[7] };
#endif
return out;
}

inline fptype_v
fpvsplit1( const fptype2_v& v )
{
/*
fptype_v out = {}; // see #594
for( int ieppV = 0; ieppV < neppV; ieppV++ )
{
out[ieppV] = v[ieppV+neppV];
}
*/
#if MGONGPU_CPPSIMD == 2
fptype_v out =
{ (fptype)v[2], (fptype)v[3] };
#elif MGONGPU_CPPSIMD == 4
fptype_v out =
{ (fptype)v[4], (fptype)v[5], (fptype)v[6], (fptype)v[7] };
#elif MGONGPU_CPPSIMD == 8
fptype_v out =
{ (fptype)v[8], (fptype)v[9], (fptype)v[10], (fptype)v[11], (fptype)v[12], (fptype)v[13], (fptype)v[14], (fptype)v[15] };
#endif
return out;
}

#endif // #if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT

#endif // #ifndef MGONGPUCPP_GPUIMPL

//==========================================================================
Expand Down
Loading
Loading