Conversation
|
I should note that the timings above are for Nx=1600 and Ny=1275 |
|
Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups! Here is a little table: threads, timing, speedup Generally, past this point the increasing thread count isn't actually useful. |
|
Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment. 1, 0.51881068991497159, 1 This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP. |
|
Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief. threads, timing, speedup Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop. |
|
I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code! |
|
Running for 50 iterations in serial we have the following timing output real 3m34.483s and for parallel real 1m51.155s So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs. |
|
Ok I made each loop in calc explicit parallel and got the overall runtime down to real 1m39.510s So we have 214 (s) / 99 (s) = 2.16x speedup! |
|
OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below. !$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
do it = 1,Nx
!$OMP CRITICAL
! Solve for v
call calc_vi(tmp_uy, phi(:,it))
uy(:,it) = tmp_uy
! Solve for u
if (kx(it) /= 0.0_dp) then
!ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
ux(:,it) = CI*d1y(tmp_uy)/kx(it)
else if (kx(it) == 0.0_dp) then
ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
end if
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DOThis is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like |
|
I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results: with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads |
A few timing results
num threads, timing of loop 3
1, 0.92242285818792880
2, 0.66375376703217626
4, 0.66562559804879129
8, 0.67282426310703158