-
Notifications
You must be signed in to change notification settings - Fork 33
CBC_FLUDS uses memory pool allocator to minimize memory usage for cell angular fluxes during sweeps
#790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@wdarylhawkins I'd like to request a review for this PR. I forgot to add assignees to this PR, and I'm not able to edit the PR to add reviewers. |
ragusa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wdhawkins I have no problem approving this. However, for this type of PR, should we get in the habit of copy/pasting scaling results in the discussion/conversation dialog box of the PR? I kinda feel I need to know whether the results changed or not. And if this does not apply here, just tell me as well.
My bad. That information was already present. I just didn't see it without scrolling more. |
c049b4d to
d58c250
Compare
|
@wdhawkins I made the following changes:
|
d58c250 to
5309776
Compare
23d5694 to
2239fb6
Compare
…ost Pool's simple segregated storage class to manage angular flux data within the `CBC_FLUDS` class. During a sweep, as soon as the CBC algorithm sends a cell's angular flux data to all of its local downwind dependencies, the storage for that cell's angular flux data can be reused for another local cell that has not yet been solved. For a given `CBC_SPDS`, the CBC algorithm currently stores angular fluxes for all cells in the `CBC_SPDS`, for angles in the quadrature, and for all groups in a given group set, which results in high memory intensity during sweeps. The changes in this commit include the following: - The `CBC_SPDS` class has a new method to simulate a local sweep, which determines the minimum number of free slots (blocks of memory) that the `CBC_FLUDS` class needs to properly store cell angular fluxes during a given sweep. The simulated sweep iterates through the `CBC_SPDS`'s task list in the same manner as in the `CBC_AngleSet::AngleSetAdvance` method, and determines the peak number of cells whose angular fluxes have to be stored during a sweep. This is done for cells that have purely local upwind and downwind dependencies. For cells that have either remote upwind or remote downwind dependencies, the simulated sweep sets aside a slot for each one of these cells. Cells with either remote upwind or remote downwind dependencies cannot cannot have their corresponding memory blocks be returned to the free pool after all of their local downwind dependencies have received the appropriate angular fluxes. This is due to the non-deterministic and asynchronous nature of the CBC algorithm's communication patterns. A simulated sweep can neither determine ahead of time when a cell with remote upwind dependencies will get its necessary angular fluxes nor determine ahead of time when a cell with remote downwind dependencies communicates its fluxes to said dependencies. - The `CBC_FLUDS` class uses the Boost Pool's library simple segregated storage class to implement a free-list pool allocator. Using the minimum number of slots calculated by the simulated sweep, the `CBC_FLUDS` class constructs a backing buffer with as many elements as the product of the minimum number of pool slots, the number of angles in the associated angle set, and the number of groups in the associated group set. The simple segregated storage object manages this backing buffer, and uses an internal free-list to hand out free pool slots and take back pool slots for cells whose angular flux data no longer needs to be stored during the sweep. - The `CBC_AngleSet::AngleSetAdvance` method uses the `CBC_FLUDS::Allocate` method to associate a free slot to a cell that is ready to be solved. After a cell has been swept, its local predecessors have their dependency consumption counts incremented. When a local predecessor's dependency consumption count equals or exceeds its local downwind dependency count, its corresponding slot is returned back to the pool via the `CBC_FLUDS::Deallocate` method.
2239fb6 to
8e43370
Compare
This PR introduces a free-list memory pool allocator using the Boost Pool's simple segregated storage class to manage angular flux data within the
CBC_FLUDSclass. During a sweep, as soon as the CBC algorithm sends a cell's angular flux data to all of its local downwind dependencies, the storage for that cell's angular flux data can be reused for anotherlocal cell that has not yet been solved.
For a given
CBC_SPDS, the CBC algorithm currently stores angular fluxes for all cells in theCBC_SPDS, for angles in the quadrature, and for all groups in a given group set, which results in high memory intensity during sweeps. The changes in this commit include the following:CBC_SPDSclass has a new method to simulate a local sweep, which determines the minimum number of free slots (blocks of memory) that theCBC_FLUDSclass needs to properly store cell angular fluxes during a given sweep. The simulated sweep iterates through theCBC_SPDS's task list in the same manner as in theCBC_AngleSet::AngleSetAdvancemethod, and tracks when cells need to have a free slot assigned to them, and when cells with associated slots can return said slots back to the pool. This is done for cells that have purely local upwind and downwind dependencies. For cells that have either remote upwind or remote downwind dependencies, the simulatedsweep sets aside a permanent slot for each one of these cells. Cells with either remote upwind or remote downwind dependencies cannot cannot have their corresponding memory blocks be returned to the free pool even after all of their local downwind dependencies have received the appropriate angular fluxes. This is due to the non-deterministic and asynchronous nature of the CBC algorithm's communication patterns. A simulated sweep can neither determine ahead of time when a cell with remote upwind dependencies will get its necessary angular fluxes nor determine ahead of time when a cell with remote downwind dependencies communicates its fluxes to said dependencies.
CBC_FLUDSclass uses the Boost Pool's library simple segregated storage class to implement a free-list pool allocator. Using the minimum number of slots calculated by the simulated sweep, theCBC_FLUDSclass constructs a backing buffer with as many elements as the product of the minimum number of pool slots, the number of angles in the associated angle set, and the number of groups in the associated group set. The simple segregated storage object manages this backing buffer, and uses an internal free-list to hand out free pool slots and return slots back to the pool for cells whose angular flux data no longer needs to be stored during the sweep.CBC_AngleSet::AngleSetAdvancemethod uses theCBC_FLUDS::Allocatemethod to associate a free slot to a cell that is ready to be solved. After a cell has been swept, its local predecessors have their dependency consumption counts incremented. When a local predecessor's dependency consumption count equals or exceeds its local downwind dependency count, its corresponding slot is returned back to the pool via theCBC_FLUDS::Deallocatemethod.Below is a strong-scaling plot for the CBC algorithm run with the
CBC_FLUDSclass with this pool allocator on LLNL Dane, whose node specs are given below.The scaling study was run on 1, 2, 4, 8, 16, 32, 64, and 128 nodes, with 64 ranks per node (rpn).
Below is a plot showing the strong-scaling results from PR 808 and this PR for 2, 4, 8, 16, 32, and 128 nodes. The 1-node results for both PRs have been omitted on this plot as PR 808 is not able to run on 1 node on LLNL Dane due to memory limitations.
Below is a weak-scaling plot for the CBC algorithm run with the
CBC_FLUDSclass with this pool allocator on LLNL Dane.The weak-scaling study was run on 1, 2, 4, 8, 16, 32, and 64 nodes, with 64 rpn.