Skip to content

Conversation

@LonelyCat124
Copy link
Collaborator

No description provided.

@LonelyCat124 LonelyCat124 changed the title Initial implementation of maximal parallel region trans. Tests TODO (Closes #3157) Initial implementation of maximal parallel region trans. Oct 31, 2025
@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.95%. Comparing base (ee25f8a) to head (67ae4c0).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3205   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         376      378    +2     
  Lines       53485    53564   +79     
=======================================
+ Hits        53463    53542   +79     
  Misses         22       22           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LonelyCat124
Copy link
Collaborator Author

@MetBenjaminWent Chris said you had some cases worth trying this with as functionality tests?

@LonelyCat124
Copy link
Collaborator Author

@LonelyCat124
Copy link
Collaborator Author

Made a bit more progress with this now - there was definitely some missing logic for bdy_impl3 to work.

One thing that is apparent that applying things in this way means we result in barriers that live outside parallel regions, which should be purged by OMPMinimiseSyncTrans, but currently aren't.

@sergisiso @arporter Am I ok to make that change as part of this PR?

@LonelyCat124
Copy link
Collaborator Author

I fixed the previous issue, and this has raised some other challenges with the cases I receieved from MO.

  1. OpenMP sentinel isn't kept, and we don't seem to have an option to PSyclone to keep statements with the OpenMP sentinel, which is a problem for some pre-existing files that have code that uses the OpenMP sentinel for conditional compilation. @hiker I think you've mentioned this previously - do we have any branch of PSyclone that handles this at all? I know its in fparser so the frontend can presumably do something with it in PSyclone.
  2. Assignments are currently always excluded from parallel regions, but this is not a good idea. For example we could have a statement such as x = y / omp_get_thread_num() which would have to be inside a parallel region. I was wondering about just having scalar assignments as valid to be inside parallel regions, but this is not so straightforward and I don't have a good answer to how to do this - perhaps scalar assignments to local variables are allowed and others aren't? I'm sure there are still some edge cases here though.
  3. We can't handle some loop structures, e.g. we don't parallelise over jj here:
omp_block = tdims%j_end
!$ omp_block = ceiling(tdims%j_end/real(omp_get_num_threads()))

!$OMP do SCHEDULE(STATIC)
do jj = tdims%j_start, tdims%j_end, omp_block
  do k = blm1, 2, -1
    l = 0
    do j = jj, min(jj+omp_block-1, tdims%j_end)
      do i = tdims%i_start, tdims%i_end
        r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
        rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
        dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
...

I assume we just need to use force=True for this loop, but in theory we could evaluate the i,j,k indices and find that they are guaranteed non-overlapping (i think? since we step by omp_block at jj, and j is thus independent) but its not straightforward. I can't remember if we already have an issue about this kinda of analysis? @hiker @sergisiso

@sergisiso
Copy link
Collaborator

Regarding 3. This is what @mn416 #3213 could solve, it may be worth trying with his dep_analysis

@LonelyCat124
Copy link
Collaborator Author

Yeah, I'd just been looking at the PR earlier today after I looked at this - I guess its still a while until we'd have it ready, but if it could handle cases like this it would definitely help (assuming the rest is otherwise ok).

@mn416
Copy link
Collaborator

mn416 commented Nov 26, 2025

I copied this chunk of code into a minimally compiling module:

module example_module
  implicit none

  type :: Dims
     integer :: j_start, j_end, i_start, i_end
  end type

contains

  subroutine sub(tdims, r_rho_levels, dtrdz_charney_grid, fqw, dqw, dqw_nt, gamma2, blm1, omp_block)
    type(Dims), intent(inout) :: tdims
    integer, intent(inout) :: r_rho_levels(:,:,:)
    integer, intent(inout) :: dtrdz_charney_grid(:,:,:)
    integer, intent(inout) :: fqw(:,:,:)
    integer, intent(inout) :: dqw(:,:,:)
    integer, intent(inout) :: dqw_nt(:,:,:)
    integer, intent(inout) :: gamma2(:,:)
    integer, intent(inout) :: blm1, omp_block
    integer :: jj, k, r_sq, rr_sq, l, i, j

    do jj = tdims%j_start, tdims%j_end, omp_block
      do k = blm1, 2, -1
        l = 0
        do j = jj, min(jj+omp_block-1, tdims%j_end)
          do i = tdims%i_start, tdims%i_end
            r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
            rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
            dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
          end do
        end do
      end do
    end do

  end subroutine
end module

The analysis gives the following output:

Routine sub: 
  Loop jj: conflict free
  Loop k: conflict free
  Loop j: conflict free
  Loop i: conflict free

Interestingly, it does take a few seconds to prove that the jj loop is conflict free.

@LonelyCat124
Copy link
Collaborator Author

I used force=True for the jj loops in the bdy_impl3.F90 file, and I get only 3 parallel regions.

There are 2 things left to potentially look at.

  1. How to determine whether an Assignment is safe to go inside a parallel region - or if we should worry about that right now (and if we can safely determine it and that the assigned variable is private). There are some slightly naive rules I could make (e.g. scalar assignments to local variables are fine), but its difficult to determine if this is actually true. I can't even rely on if they're read-only, as it could need to be read from some loc that ends up outside the parallel region, and since the parallel region is built up manually I'm not sure how to solve this. Any ideas? @sergisiso @hiker
  2. The common pattern of omp end do nowait into omp barrier. I end up (after barrier reduction) with some of these, and I think it might be slightly neater to just convert this pattern into simple omp end do, but it shouldn't move the needle in a relevant way w.r.t performance, so I'm happy to leave this until later.

@sergisiso
Copy link
Collaborator

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@LonelyCat124
Copy link
Collaborator Author

LonelyCat124 commented Jan 12, 2026

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@sergisiso I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied. I think at least I'd need/want to add a check to see if any omp_... calls are included on the RHS of an assignment for now and fail validation if so? What do you think?

I had started implemeting some logic now for Assignment but the next accesses check is somewhat complicated.

Edit Though: I think we're also limited without Matthew's #3213 anyway so maybe the assignment checks aren't so urgent.

Edit 2: This was my current plan:

        # Assignments can be in the parallel region if they write to a locally
        # defined scalar variable (i.e. the symbol has a AutomaticInterface).
        # TODO The lhs variable needs to be private - not sure we can do that
        # yet.
        if isinstance(node, Assignment):
            if not isinstance(node.lhs.symbol.interface, AutomaticInterface):
                return False
            if node.lhs.symbol.is_array:
                return False
            # TODO We need to mark the LHS variable as private.
            # TODO We need to ensure the next access to the lhs variable is
            # also contained in the same parallel region.
            return True

But the next access check needs to happen in apply when applying the parallel transformation and becomes a bit of a mess I think (since we may end up having to split the parallel regions again at this point).

@sergisiso
Copy link
Collaborator

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso
Copy link
Collaborator

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

@LonelyCat124
Copy link
Collaborator Author

LonelyCat124 commented Jan 12, 2026

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

Yeah it would probably be similar, but I was testing it with some code (https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90) and we do a lot worse than the manual implementation. I think omp_block = ceiling(tdims%j_end/real(omp_get_num_threads())) is safe to have outside the parallel region, but since we can't put the following loop in an omp do anyway (maybe with force we can) it might not be important.

I'm happy to just go with this simple version initially though and then improve it, probably makes reviewing more straight forward. I'll write some tests for the current functionality and add some stuff to the docstrings.

@LonelyCat124
Copy link
Collaborator Author

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso An example of one we'd do wrong right now would be any code where have x = omp_get_thread_num() as a child of a Routine, since I think this statement is only valid inside a parallel region, and we'd never consider to put the assignment in a parallel region.

@sergisiso
Copy link
Collaborator

sergisiso commented Jan 12, 2026

These last 2 would also be an issue without this transformation if the do the typical for loop in psyir.walk(Loop): OMPLoopTrans. In both cases (new transformation or previous approach) they need to be found and added the Parallel region independently. So I would still go ahead without the Assignments for this first PR.

@LonelyCat124 LonelyCat124 marked this pull request as ready for review January 12, 2026 14:15
@LonelyCat124
Copy link
Collaborator Author

Blocked due to circular dependencies in examples - will update with blocking PR when created.

@LonelyCat124 LonelyCat124 added the Blocked An issue/PR that is blocked by one or more issues/PRs. label Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Blocked An issue/PR that is blocked by one or more issues/PRs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants