-
Notifications
You must be signed in to change notification settings - Fork 33
Version 1 of device CBC sweep chunk kernel #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Version 1 of device CBC sweep chunk kernel #868
Conversation
|
@wdarylhawkins @quocdang1998 Requesting review for this PR. |
fe8de5e to
d2177bf
Compare
andrsd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First rough pass through the PR.
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/angle_set/cbc_angle_set.cc
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/angle_set/cbc_angle_set.h
Outdated
Show resolved
Hide resolved
...es/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/communicators/cbc_async_comm.cc
Outdated
Show resolved
Hide resolved
...les/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/communicators/cbc_async_comm.h
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/fluds/cbc_fluds.cc
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/scheduler/sweep_scheduler.cc
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep/scheduler/sweep_scheduler.h
Outdated
Show resolved
Hide resolved
...les/linear_boltzmann_solvers/discrete_ordinates_problem/sweep_chunks/cbc_gpu_kernel/buffer.h
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep_chunks/cbc_sweep_chunk.cc
Outdated
Show resolved
Hide resolved
modules/linear_boltzmann_solvers/discrete_ordinates_problem/sweep_chunks/cbc_sweep_chunk.h
Outdated
Show resolved
Hide resolved
9d0105f to
f153101
Compare
implemented. The following are the major changes: - The function `SweepScheduler::DeviceScheduleAlgoFIFO` performs sweeps asychronously over all anglesets in a given groupset. This is done via `caribou::Stream`'s. The main while-loop loops over all anglesets and checks for cells that can be swept. The ready cells for a given angleset are batched and given to the device CBC sweep chunk kernel, and is executed asynchronously. The loop then performs said sweeps for another angleset's ready cells. The `CBC_AngleSet`'s asynchronous communicator still receives and sends upwind angular fluxes and downwind angular fluxes, respectively, as soon as possible, which mirrors the host CBC sweep chunk kernel. - There is a device-version of the `CBC_FLUDS` class that contains host and device buffers for boundary, local, and non-local angular flux data. Local cell angular flux data is kept on the device. The `CBCD_FLUDS` class provides methods for asynchronously copying incoming/outgoing boundary and non-local cell angular flux data to the appropriate host and device buffers during sweeps. The transfers use the associated `CBC_AngleSet`'s `caribou::Stream`. The device sweep chunk kernel works in parallel over a given set of ready cells and for all groups in a given groupset. - The `CBCD_FLUDSCommonData` class contains similar functionality as in the `AAHD_FLUDSCommonData` class for encoding/decoding the appropriate locations that cell face nodes need to be read from and written to in the local, boundary, and non-local host/device buffers. - The device CBC sweep chunk kernel is templated on the number of cell spatial DOFs. The functionality of the device kernel is structured in a similar way as the device AAH sweep chunk kernel. - `caribou`'s `Stream` class now contains methods for asynchronously copying data between `DeviceMemory` and `HostVector` objects using a provided stream. The device CBC sweep chunk kernel passes all of the GPU regression tests, save for the `transport_3d_4_cycles_1_gpu.py`. This is because the CBC sweep algorithm does not handle cyclic dependencies. Caveats: - The device CBC sweep chunk kernel does not yet have the functionality to save cell angular fluxes. This will be added in a later PR.
f153101 to
f186816
Compare
| /** | ||
| * @brief Get the underlying CUDA stream. | ||
| */ | ||
| inline ::cudaStream_t get() const { return this->operator->(); } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this. the get() method is already inherited directly from std::shared_ptr.
| if (use_gpus) | ||
| { | ||
| CreateStream(); | ||
| AssociateAngleSetWithFLUDS(); | ||
| } | ||
| } | ||
|
|
||
| CBC_AngleSet::~CBC_AngleSet() | ||
| { | ||
| if (use_gpus_) | ||
| DestroyStream(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike MemoryPinningManager, which is based on std::unique_ptr (non copyable), Stream is based on std::shared_ptr, which is std::shared_ptr (and copyable).
Use to std::any here to store the caribou::Stream on the angleset. You won't have to keep track of Creation and Destroy of the Stream.
| #include "external/caribou/caribou.h" | ||
| #include "external/caribou/cuda/stream.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "external/.
| #include "framework/logging/log.h" | ||
| #include "framework/runtime.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these include headers.
| namespace caribou | ||
| { | ||
| class Stream; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these. Use std::any for header and include caribou/caribou.hpp directly in .cu files.
| /** | ||
| * \brief Host-analogous version of GetCBCDCellFaceDataIndex for host copy of the map | ||
| * \param host_cell_face_node_map Reference to the host copy of the flattened node index structure. | ||
| * \param cell_local_idx Cell local index. | ||
| * \return Pointer to the indexes of the cell and the number of indexes (total number of cell face | ||
| * nodes) for the cell. | ||
| */ | ||
| constexpr std::pair<const std::uint64_t*, std::uint64_t> | ||
| GetCBCDCellFaceDataIndexHost(const std::vector<uint64_t>& host_cell_face_node_map, | ||
| const std::uint32_t& cell_local_idx) | ||
| { | ||
| const std::uint64_t* cell_face_data = | ||
| host_cell_face_node_map.data() + static_cast<std::uint64_t>(2 * cell_local_idx); | ||
| return {host_cell_face_node_map.data() + cell_face_data[0], cell_face_data[1]}; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does the exact same thing as GetCBCDCellFaceDataIndex. It is just a change of argument from a pointer to std::vector!
When marked with constexpr, the function can be used on both host and device. Find a name that covers both cases.
| * Once created, the indices are utilized by the CBCD_FLUDS to access the correct | ||
| * location in the flux storage arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there is a weird indent here?
| /** | ||
| * \brief Represents a cell face node. | ||
| * \details This class represents a cell face node. It stores the cell local index, the face index | ||
| * within the cell, and the index of the node relative to the face as a compact 64-bit integer. This | ||
| * class abstracts the node for usage with std::map and std::set. | ||
| */ | ||
| class CBCD_FaceNode | ||
| { | ||
| public: | ||
| /// Default constructor. | ||
| constexpr CBCD_FaceNode() = default; | ||
|
|
||
| /// Member constructor. | ||
| constexpr CBCD_FaceNode(std::uint32_t cell_idx, | ||
| std::uint16_t face_idx, | ||
| std::uint16_t face_node_idx) | ||
| : value_(0) | ||
| { | ||
| value_ |= static_cast<std::uint64_t>(cell_idx) << 32; | ||
| value_ |= static_cast<std::uint64_t>(face_idx) << 16; | ||
| value_ |= static_cast<std::uint64_t>(face_node_idx); | ||
| } | ||
|
|
||
| /// Comparison operator for ordering. | ||
| constexpr bool operator<(const CBCD_FaceNode& other) const { return value_ < other.value_; } | ||
|
|
||
| /// Equality operator. | ||
| constexpr bool operator==(const CBCD_FaceNode& other) const { return value_ == other.value_; } | ||
|
|
||
| /// Get cell local index. | ||
| constexpr std::uint32_t GetCellIndex() const { return static_cast<std::uint32_t>(value_ >> 32); } | ||
|
|
||
| /// Get face index. | ||
| constexpr std::uint16_t GetFaceIndex() const | ||
| { | ||
| return static_cast<std::uint16_t>((value_ >> 16) & 0xFFFFU); | ||
| } | ||
|
|
||
| /// Get face node index. | ||
| constexpr std::uint16_t GetFaceNodeIndex() const | ||
| { | ||
| return static_cast<std::uint16_t>(value_ & 0xFFFFU); | ||
| } | ||
|
|
||
| /// Check if face node is initialized. | ||
| constexpr bool IsInitialized() const | ||
| { | ||
| return value_ != std::numeric_limits<std::uint64_t>::max(); | ||
| } | ||
|
|
||
| private: | ||
| /// Core value. | ||
| std::uint64_t value_ = std::numeric_limits<std::uint64_t>::max(); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the exact same implementation as AAHD_FaceNode.
If you're using the same implementation, just copy the class implementation into a file (sthg like fluds_common_data_structs.h), rename the class (and change the name of all occurrences) and include the new header in both aahd_structs.h and cbcd_structs.h.
| /** | ||
| * \brief Class that represents a 64-bit integer encoding the index and the host/device information | ||
| * of a face node. | ||
| * \note The index follows the following mutually exclusive rules: | ||
| * - Boundary status is checked BEFORE local status. Although boundary nodes are treated as | ||
| * local nodes, their locations are set to an undefined value. | ||
| * - A node cannot be both a boundary node and a non-local node. | ||
| */ | ||
| class CBCD_NodeIndex | ||
| { | ||
| public: | ||
| /// Default constructor | ||
| CBCD_NodeIndex() = default; | ||
|
|
||
| /// Direct assign core value | ||
| constexpr CBCD_NodeIndex(const std::uint64_t& value) : value_(value) {} | ||
|
|
||
| /** | ||
| * Construct a non-boundary node index. | ||
| * \param index Index into the corresponding bank. Cannot exceed 2^60 - 1. | ||
| * \param is_outgoing Flag indicating if the node corresponds to an outgoing face. | ||
| * \param is_local Flag indicating if the index is in a local bank. | ||
| */ | ||
| CBCD_NodeIndex(std::uint64_t index, bool is_outgoing, bool is_local) | ||
| { | ||
| if (index >= (std::uint64_t(1) << 60) - 1) | ||
| throw std::runtime_error("Cannot hold an index greater than 2^60."); | ||
| SetInOut(is_outgoing); | ||
| SetLocal(is_local); | ||
| SetBoundary(false); | ||
| SetIndex(index); | ||
| } | ||
|
|
||
| /** | ||
| * Construct a boundary node index. | ||
| * \param index Index into the corresponding bank. Cannot exceed 2^40 - 1. | ||
| * \param is_outgoing Flag indicating if the node corresponds to an outgoing face. | ||
| */ | ||
| CBCD_NodeIndex(std::uint64_t index, bool is_outgoing) | ||
| { | ||
| if (index >= (std::uint64_t(1) << 60) - 1) | ||
| throw std::runtime_error("Cannot hold an index greater than 2^60."); | ||
| SetInOut(is_outgoing); | ||
| SetLocal(true); | ||
| SetBoundary(true); | ||
| SetIndex(index); | ||
| } | ||
|
|
||
| /// Check if the current node's index is undefined. | ||
| constexpr bool IsUndefined() const noexcept | ||
| { | ||
| return value_ == std::numeric_limits<std::uint64_t>::max(); | ||
| } | ||
|
|
||
| /// Check if the node corresponds to an outgoing face. | ||
| constexpr bool IsOutgoing() const noexcept { return (value_ & inout_bit_mask) != 0; } | ||
|
|
||
| /// Check if the node corresponds to a boundary face. | ||
| constexpr bool IsBoundary() const noexcept { return (value_ & boundary_bit_mask) != 0; } | ||
|
|
||
| /// Check if the node index corresponds to a local face. | ||
| constexpr bool IsLocal() const noexcept { return (value_ & local_bit_mask) != 0; } | ||
|
|
||
| /// Get the node index into the appropriate host/device buffer. | ||
| constexpr std::uint64_t GetIndex() const noexcept { return value_ & index_bit_mask; } | ||
|
|
||
| /// Get the node's core value (can be used with the node tracker map to retrieve a given node's | ||
| /// index into the appropriate host/device buffer). | ||
| constexpr std::uint64_t GetCoreValue() const noexcept { return value_; } | ||
|
|
||
| private: | ||
| /// Incoming/outgoing bit | ||
| /// First bit mask (``1`` followed by 63 zeros). | ||
| static constexpr std::uint64_t inout_bit_mask = std::uint64_t(1) << (64 - 1); | ||
|
|
||
| /// Encode the node as incoming or outgoing. | ||
| constexpr void SetInOut(bool is_outgoing) noexcept | ||
| { | ||
| if (is_outgoing) | ||
| value_ |= inout_bit_mask; | ||
| else | ||
| value_ &= ~inout_bit_mask; | ||
| } | ||
|
|
||
| /// Boundary bit | ||
| /// Second bit mask (``01`` followed by 62 zeros). | ||
| static constexpr std::uint64_t boundary_bit_mask = std::uint64_t(1) << (64 - 2); | ||
|
|
||
| /// Encode the node as either a boundary node or non-boundary node. | ||
| constexpr void SetBoundary(bool is_boundary) noexcept | ||
| { | ||
| if (is_boundary) | ||
| value_ |= boundary_bit_mask; | ||
| else | ||
| value_ &= ~boundary_bit_mask; | ||
| } | ||
|
|
||
| /// Local bit | ||
| /// Third bit mask (``001`` followed by 61 zeros). | ||
| static constexpr std::uint64_t local_bit_mask = std::uint64_t(1) << (64 - 3); | ||
|
|
||
| /// Encode the node as either a local or non-local node. | ||
| constexpr void SetLocal(bool is_local) noexcept | ||
| { | ||
| if (is_local) | ||
| value_ |= local_bit_mask; | ||
| else | ||
| value_ &= ~local_bit_mask; | ||
| } | ||
|
|
||
| // Index bits | ||
| // Index bit mask (``1`` at the last 61 bits) | ||
| static constexpr std::uint64_t index_bit_mask = (std::uint64_t(1) << (64 - 3)) - 1; | ||
|
|
||
| /// Encode the node index. | ||
| constexpr void SetIndex(const std::uint64_t& index) noexcept | ||
| { | ||
| value_ &= ~index_bit_mask; | ||
| value_ |= (index & index_bit_mask); | ||
| } | ||
|
|
||
| /** | ||
| * \brief Core value. | ||
| * \details The value's bits contain the following information: | ||
| * - 1 bit for incoming/outgoing node | ||
| * - 1 bit for boundary or non-boundary node | ||
| * - 1 bit for local or non-local node | ||
| * - 61 bits for index into the appropriate host/device buffer | ||
| */ | ||
| std::uint64_t value_ = std::numeric_limits<std::uint64_t>::max(); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. Factor the common implementations out and put them in the same file.
Next, use inheritance for the implementation of both CBCD_NodeIndex and AAHD_NodeIndex.
| /// Move constructor | ||
| CBCD_FLUDSPointerSet(CBCD_FLUDSPointerSet&&) = default; | ||
|
|
||
| /// Copy and move assignment operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only the copy assignment operator.
In this PR, version 1 of the device CBC sweep chunk kernel has been implemented.
The following are the major changes:
SweepScheduler::DeviceScheduleAlgoFIFOperforms sweeps asynchronously over all anglesets in a given groupset. This is done viacaribou::Stream's. The main while-loop loops over all anglesets and checks for cells that can be swept. The ready cells for a given angleset are batched and given to the device CBC sweep chunk kernel, and is executed asynchronously. The loop then performs said sweeps for another angleset's ready cells. The device sweep chunk kernel works in parallel over a given set of ready cells and for all groups in a given groupset. TheCBC_AngleSet's asynchronous communicator still receives and sends upwind angular fluxes and downwind angular fluxes, respectively, as soon as possible, which mirrors the host CBC sweep chunk kernel.CBC_FLUDSclass that contains host and device buffers for boundary, local, and non-local angular flux data. Local cell angular flux data is kept on the device. TheCBCD_FLUDSclass provides methods for asynchronously copying incoming/outgoing boundary and non-local cell angular flux data to the appropriate host and device buffers during sweeps. The transfers use the associatedCBC_AngleSet'scaribou::Stream.CBCD_FLUDSCommonDataclass contains similar functionality as in theAAHD_FLUDSCommonDataclass for encoding/decoding the appropriate locations that cell face nodes need to be read from and written to in the local, boundary, and non-local host/device buffers.caribou'sStreamclass now contains methods for asynchronously copying data betweenDeviceMemoryandHostVectorobjects using a provided stream.The device CBC sweep chunk kernel passes all of the GPU regression tests, save for the
transport_3d_4_cycles_1_gpu.py. This is because the CBC sweep algorithm does not handle cyclic dependencies.Caveats: