(Closes #3157) Initial implementation of maximal parallel region trans. #3205

LonelyCat124 · 2025-10-31T12:12:13Z

No description provided.

codecov · 2025-10-31T12:23:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.95%. Comparing base (ee25f8a) to head (67ae4c0).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3205   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         376      378    +2     
  Lines       53485    53564   +79     
=======================================
+ Hits        53463    53542   +79     
  Misses         22       22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

LonelyCat124 · 2025-11-03T10:20:57Z

@MetBenjaminWent Chris said you had some cases worth trying this with as functionality tests?

LonelyCat124 · 2025-11-21T15:17:45Z

The examples I've been give by to look at are:
https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90 (simpler one)
https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/ex_coef.F90 (more complex one)

LonelyCat124 · 2025-11-24T15:15:22Z

Made a bit more progress with this now - there was definitely some missing logic for bdy_impl3 to work.

One thing that is apparent that applying things in this way means we result in barriers that live outside parallel regions, which should be purged by OMPMinimiseSyncTrans, but currently aren't.

@sergisiso @arporter Am I ok to make that change as part of this PR?

…y on applying inside if/loop nodes

LonelyCat124 · 2025-11-26T13:42:13Z

I fixed the previous issue, and this has raised some other challenges with the cases I receieved from MO.

OpenMP sentinel isn't kept, and we don't seem to have an option to PSyclone to keep statements with the OpenMP sentinel, which is a problem for some pre-existing files that have code that uses the OpenMP sentinel for conditional compilation. @hiker I think you've mentioned this previously - do we have any branch of PSyclone that handles this at all? I know its in fparser so the frontend can presumably do something with it in PSyclone.
Assignments are currently always excluded from parallel regions, but this is not a good idea. For example we could have a statement such as x = y / omp_get_thread_num() which would have to be inside a parallel region. I was wondering about just having scalar assignments as valid to be inside parallel regions, but this is not so straightforward and I don't have a good answer to how to do this - perhaps scalar assignments to local variables are allowed and others aren't? I'm sure there are still some edge cases here though.
We can't handle some loop structures, e.g. we don't parallelise over jj here:

omp_block = tdims%j_end
!$ omp_block = ceiling(tdims%j_end/real(omp_get_num_threads()))

!$OMP do SCHEDULE(STATIC)
do jj = tdims%j_start, tdims%j_end, omp_block
  do k = blm1, 2, -1
    l = 0
    do j = jj, min(jj+omp_block-1, tdims%j_end)
      do i = tdims%i_start, tdims%i_end
        r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
        rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
        dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
...

I assume we just need to use force=True for this loop, but in theory we could evaluate the i,j,k indices and find that they are guaranteed non-overlapping (i think? since we step by omp_block at jj, and j is thus independent) but its not straightforward. I can't remember if we already have an issue about this kinda of analysis? @hiker @sergisiso

sergisiso · 2025-11-26T15:43:39Z

Regarding 3. This is what @mn416 #3213 could solve, it may be worth trying with his dep_analysis

LonelyCat124 · 2025-11-26T15:52:03Z

Yeah, I'd just been looking at the PR earlier today after I looked at this - I guess its still a while until we'd have it ready, but if it could handle cases like this it would definitely help (assuming the rest is otherwise ok).

mn416 · 2025-11-26T16:01:26Z

I copied this chunk of code into a minimally compiling module:

module example_module
  implicit none

  type :: Dims
     integer :: j_start, j_end, i_start, i_end
  end type

contains

  subroutine sub(tdims, r_rho_levels, dtrdz_charney_grid, fqw, dqw, dqw_nt, gamma2, blm1, omp_block)
    type(Dims), intent(inout) :: tdims
    integer, intent(inout) :: r_rho_levels(:,:,:)
    integer, intent(inout) :: dtrdz_charney_grid(:,:,:)
    integer, intent(inout) :: fqw(:,:,:)
    integer, intent(inout) :: dqw(:,:,:)
    integer, intent(inout) :: dqw_nt(:,:,:)
    integer, intent(inout) :: gamma2(:,:)
    integer, intent(inout) :: blm1, omp_block
    integer :: jj, k, r_sq, rr_sq, l, i, j

    do jj = tdims%j_start, tdims%j_end, omp_block
      do k = blm1, 2, -1
        l = 0
        do j = jj, min(jj+omp_block-1, tdims%j_end)
          do i = tdims%i_start, tdims%i_end
            r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
            rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
            dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
          end do
        end do
      end do
    end do

  end subroutine
end module

The analysis gives the following output:

Routine sub: 
  Loop jj: conflict free
  Loop k: conflict free
  Loop j: conflict free
  Loop i: conflict free

Interestingly, it does take a few seconds to prove that the jj loop is conflict free.

LonelyCat124 · 2025-12-10T14:00:05Z

I used force=True for the jj loops in the bdy_impl3.F90 file, and I get only 3 parallel regions.

There are 2 things left to potentially look at.

How to determine whether an Assignment is safe to go inside a parallel region - or if we should worry about that right now (and if we can safely determine it and that the assigned variable is private). There are some slightly naive rules I could make (e.g. scalar assignments to local variables are fine), but its difficult to determine if this is actually true. I can't even rely on if they're read-only, as it could need to be read from some loc that ends up outside the parallel region, and since the parallel region is built up manually I'm not sure how to solve this. Any ideas? @sergisiso @hiker
The common pattern of omp end do nowait into omp barrier. I end up (after barrier reduction) with some of these, and I think it might be slightly neater to just convert this pattern into simple omp end do, but it shouldn't move the needle in a relevant way w.r.t performance, so I'm happy to leave this until later.

sergisiso · 2026-01-12T10:38:46Z

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

LonelyCat124 · 2026-01-12T10:43:42Z

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@sergisiso I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied. I think at least I'd need/want to add a check to see if any omp_... calls are included on the RHS of an assignment for now and fail validation if so? What do you think?

I had started implemeting some logic now for Assignment but the next accesses check is somewhat complicated.

Edit Though: I think we're also limited without Matthew's #3213 anyway so maybe the assignment checks aren't so urgent.

Edit 2: This was my current plan:

        # Assignments can be in the parallel region if they write to a locally
        # defined scalar variable (i.e. the symbol has a AutomaticInterface).
        # TODO The lhs variable needs to be private - not sure we can do that
        # yet.
        if isinstance(node, Assignment):
            if not isinstance(node.lhs.symbol.interface, AutomaticInterface):
                return False
            if node.lhs.symbol.is_array:
                return False
            # TODO We need to mark the LHS variable as private.
            # TODO We need to ensure the next access to the lhs variable is
            # also contained in the same parallel region.
            return True

But the next access check needs to happen in apply when applying the parallel transformation and becomes a bit of a mess I think (since we may end up having to split the parallel regions again at this point).

sergisiso · 2026-01-12T10:52:27Z

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

sergisiso · 2026-01-12T10:58:28Z

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

LonelyCat124 · 2026-01-12T11:17:35Z

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

Yeah it would probably be similar, but I was testing it with some code (https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90) and we do a lot worse than the manual implementation. I think omp_block = ceiling(tdims%j_end/real(omp_get_num_threads())) is safe to have outside the parallel region, but since we can't put the following loop in an omp do anyway (maybe with force we can) it might not be important.

I'm happy to just go with this simple version initially though and then improve it, probably makes reviewing more straight forward. I'll write some tests for the current functionality and add some stuff to the docstrings.

LonelyCat124 · 2026-01-12T11:27:06Z

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso An example of one we'd do wrong right now would be any code where have x = omp_get_thread_num() as a child of a Routine, since I think this statement is only valid inside a parallel region, and we'd never consider to put the assignment in a parallel region.

sergisiso · 2026-01-12T11:45:07Z

These last 2 would also be an issue without this transformation if the do the typical for loop in psyir.walk(Loop): OMPLoopTrans. In both cases (new transformation or previous approach) they need to be found and added the Parallel region independently. So I would still go ahead without the Assignments for this first PR.

LonelyCat124 · 2026-01-12T15:45:10Z

Blocked due to circular dependencies in examples - will update with blocking PR when created.

Initial implementation of maximal parallel region trans. Tests TODO

172c205

LonelyCat124 changed the title ~~Initial implementation of maximal parallel region trans. Tests TODO~~ (Closes #3157) Initial implementation of maximal parallel region trans. Oct 31, 2025

LonelyCat124 added 2 commits October 31, 2025 15:32

Tests for base MaximalParallelRegionTrans class

8b19ef0

Missed __init__ updates

38ef3e6

Merge branch 'master' into 3157_parallel_routine_trans

586c51a

Changes to fix unnecessary parallel additions and to recurse correctl…

1b86da5

…y on applying inside if/loop nodes

LonelyCat124 added 2 commits November 26, 2025 13:49

Fixed some of the behaviour to improve barrier removal

f3fa2a3

linting

cbdde6d

Update the test to handle the new required node structure

0617319

Merge branch 'master' into 3157_parallel_routine_trans

d76e988

LonelyCat124 added 2 commits January 12, 2026 14:12

Added tests

bdc9f12

Merged master

1db7f8d

LonelyCat124 marked this pull request as ready for review January 12, 2026 14:15

Fix test overlaps

67ae4c0

LonelyCat124 requested a review from sergisiso January 12, 2026 15:05

LonelyCat124 added 2 commits January 12, 2026 15:18

Update the example

a275f48

Missing init update

f711ae9

LonelyCat124 added the Blocked An issue/PR that is blocked by one or more issues/PRs. label Jan 12, 2026

(Closes #3157) Initial implementation of maximal parallel region trans. #3205

Are you sure you want to change the base?

(Closes #3157) Initial implementation of maximal parallel region trans. #3205

Conversation

LonelyCat124 commented Oct 31, 2025

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LonelyCat124 commented Nov 3, 2025

Uh oh!

LonelyCat124 commented Nov 21, 2025

Uh oh!

LonelyCat124 commented Nov 24, 2025

Uh oh!

LonelyCat124 commented Nov 26, 2025

Uh oh!

sergisiso commented Nov 26, 2025

Uh oh!

LonelyCat124 commented Nov 26, 2025

Uh oh!

mn416 commented Nov 26, 2025

Uh oh!

LonelyCat124 commented Dec 10, 2025

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

LonelyCat124 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

LonelyCat124 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonelyCat124 commented Jan 12, 2026

Uh oh!

sergisiso commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonelyCat124 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 31, 2025 •

edited

Loading

LonelyCat124 commented Jan 12, 2026 •

edited

Loading

LonelyCat124 commented Jan 12, 2026 •

edited

Loading

sergisiso commented Jan 12, 2026 •

edited

Loading