Skip to content

Conversation

@eappen-nelluvelil
Copy link
Contributor

In this PR, version 1 of the device CBC sweep chunk kernel has been implemented.

The following are the major changes:

  • The function SweepScheduler::DeviceScheduleAlgoFIFO performs sweeps asynchronously over all anglesets in a given groupset. This is done via caribou::Stream's. The main while-loop loops over all anglesets and checks for cells that can be swept. The ready cells for a given angleset are batched and given to the device CBC sweep chunk kernel, and is executed asynchronously. The loop then performs said sweeps for another angleset's ready cells. The device sweep chunk kernel works in parallel over a given set of ready cells and for all groups in a given groupset. The CBC_AngleSet's asynchronous communicator still receives and sends upwind angular fluxes and downwind angular fluxes, respectively, as soon as possible, which mirrors the host CBC sweep chunk kernel.
  • There is a device-version of the CBC_FLUDS class that contains host and device buffers for boundary, local, and non-local angular flux data. Local cell angular flux data is kept on the device. The CBCD_FLUDS class provides methods for asynchronously copying incoming/outgoing boundary and non-local cell angular flux data to the appropriate host and device buffers during sweeps. The transfers use the associated CBC_AngleSet's caribou::Stream.
  • The CBCD_FLUDSCommonData class contains similar functionality as in the AAHD_FLUDSCommonData class for encoding/decoding the appropriate locations that cell face nodes need to be read from and written to in the local, boundary, and non-local host/device buffers.
  • The device CBC sweep chunk kernel is templated on the number of cell spatial DOFs. The functionality of the device kernel is structured in a similar way as the device AAH sweep chunk kernel.
  • caribou's Stream class now contains methods for asynchronously copying data between DeviceMemory and HostVector objects using a provided stream.

The device CBC sweep chunk kernel passes all of the GPU regression tests, save for the transport_3d_4_cycles_1_gpu.py. This is because the CBC sweep algorithm does not handle cyclic dependencies.

Caveats:

  • The device CBC sweep chunk kernel does not yet have the functionality to save cell angular fluxes. This will be added in a later PR.

@eappen-nelluvelil
Copy link
Contributor Author

@wdarylhawkins @quocdang1998 Requesting review for this PR.

Copy link
Collaborator

@andrsd andrsd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First rough pass through the PR.

@eappen-nelluvelil eappen-nelluvelil force-pushed the cbc-gpu-sweep-chunk-2 branch 6 times, most recently from 9d0105f to f153101 Compare December 18, 2025 22:59
implemented. The following are the major changes:
- The function `SweepScheduler::DeviceScheduleAlgoFIFO` performs sweeps
  asychronously over all anglesets in a given groupset. This is done via
`caribou::Stream`'s. The main while-loop loops over all anglesets and
checks for cells that can be swept. The ready cells for a given angleset
are batched and given to the device CBC sweep chunk kernel, and is
executed asynchronously. The loop then performs said sweeps for another
angleset's ready cells. The `CBC_AngleSet`'s asynchronous communicator
still receives and sends upwind angular fluxes and downwind angular
fluxes, respectively, as soon as possible, which mirrors the host CBC
sweep chunk kernel.
- There is a device-version of the `CBC_FLUDS` class that contains
   host and device buffers for boundary, local, and non-local
angular flux data. Local cell angular flux data is kept on the device.
The `CBCD_FLUDS` class provides methods for asynchronously copying
incoming/outgoing boundary and non-local cell angular flux data to the
appropriate host and device buffers during sweeps. The transfers use the
associated `CBC_AngleSet`'s `caribou::Stream`. The device sweep chunk
kernel works in parallel over a given set of ready cells and for all
groups in a given groupset.
- The `CBCD_FLUDSCommonData` class contains similar functionality as in
  the `AAHD_FLUDSCommonData` class for encoding/decoding the appropriate
locations that cell face nodes need to be read from and written to in
the local, boundary, and non-local host/device buffers.
- The device CBC sweep chunk kernel is templated on the number of cell
  spatial DOFs. The functionality of the device kernel is structured in
a similar way as the device AAH sweep chunk kernel.
- `caribou`'s `Stream` class now contains methods for asynchronously
  copying data between `DeviceMemory` and `HostVector` objects using a
provided stream.

The device CBC sweep chunk kernel passes all of the GPU regression
tests, save for the `transport_3d_4_cycles_1_gpu.py`. This is because
the CBC sweep algorithm does not handle cyclic dependencies.

Caveats:
- The device CBC sweep chunk kernel does not yet have the functionality
  to save cell angular fluxes. This will be added in a later PR.
Comment on lines +57 to +61
/**
* @brief Get the underlying CUDA stream.
*/
inline ::cudaStream_t get() const { return this->operator->(); }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. the get() method is already inherited directly from std::shared_ptr.

Comment on lines +30 to 41
if (use_gpus)
{
CreateStream();
AssociateAngleSetWithFLUDS();
}
}

CBC_AngleSet::~CBC_AngleSet()
{
if (use_gpus_)
DestroyStream();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike MemoryPinningManager, which is based on std::unique_ptr (non copyable), Stream is based on std::shared_ptr, which is std::shared_ptr (and copyable).
Use to std::any here to store the caribou::Stream on the angleset. You won't have to keep track of Creation and Destroy of the Stream.

Comment on lines +6 to +7
#include "external/caribou/caribou.h"
#include "external/caribou/cuda/stream.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "external/.

Comment on lines +8 to +9
#include "framework/logging/log.h"
#include "framework/runtime.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these include headers.

Comment on lines +11 to +14
namespace caribou
{
class Stream;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these. Use std::any for header and include caribou/caribou.hpp directly in .cu files.

Comment on lines +32 to +46
/**
* \brief Host-analogous version of GetCBCDCellFaceDataIndex for host copy of the map
* \param host_cell_face_node_map Reference to the host copy of the flattened node index structure.
* \param cell_local_idx Cell local index.
* \return Pointer to the indexes of the cell and the number of indexes (total number of cell face
* nodes) for the cell.
*/
constexpr std::pair<const std::uint64_t*, std::uint64_t>
GetCBCDCellFaceDataIndexHost(const std::vector<uint64_t>& host_cell_face_node_map,
const std::uint32_t& cell_local_idx)
{
const std::uint64_t* cell_face_data =
host_cell_face_node_map.data() + static_cast<std::uint64_t>(2 * cell_local_idx);
return {host_cell_face_node_map.data() + cell_face_data[0], cell_face_data[1]};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does the exact same thing as GetCBCDCellFaceDataIndex. It is just a change of argument from a pointer to std::vector!

When marked with constexpr, the function can be used on both host and device. Find a name that covers both cases.

Comment on lines +57 to +58
* Once created, the indices are utilized by the CBCD_FLUDS to access the correct
* location in the flux storage arrays.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is a weird indent here?

Comment on lines +15 to +68
/**
* \brief Represents a cell face node.
* \details This class represents a cell face node. It stores the cell local index, the face index
* within the cell, and the index of the node relative to the face as a compact 64-bit integer. This
* class abstracts the node for usage with std::map and std::set.
*/
class CBCD_FaceNode
{
public:
/// Default constructor.
constexpr CBCD_FaceNode() = default;

/// Member constructor.
constexpr CBCD_FaceNode(std::uint32_t cell_idx,
std::uint16_t face_idx,
std::uint16_t face_node_idx)
: value_(0)
{
value_ |= static_cast<std::uint64_t>(cell_idx) << 32;
value_ |= static_cast<std::uint64_t>(face_idx) << 16;
value_ |= static_cast<std::uint64_t>(face_node_idx);
}

/// Comparison operator for ordering.
constexpr bool operator<(const CBCD_FaceNode& other) const { return value_ < other.value_; }

/// Equality operator.
constexpr bool operator==(const CBCD_FaceNode& other) const { return value_ == other.value_; }

/// Get cell local index.
constexpr std::uint32_t GetCellIndex() const { return static_cast<std::uint32_t>(value_ >> 32); }

/// Get face index.
constexpr std::uint16_t GetFaceIndex() const
{
return static_cast<std::uint16_t>((value_ >> 16) & 0xFFFFU);
}

/// Get face node index.
constexpr std::uint16_t GetFaceNodeIndex() const
{
return static_cast<std::uint16_t>(value_ & 0xFFFFU);
}

/// Check if face node is initialized.
constexpr bool IsInitialized() const
{
return value_ != std::numeric_limits<std::uint64_t>::max();
}

private:
/// Core value.
std::uint64_t value_ = std::numeric_limits<std::uint64_t>::max();
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the exact same implementation as AAHD_FaceNode.

If you're using the same implementation, just copy the class implementation into a file (sthg like fluds_common_data_structs.h), rename the class (and change the name of all occurrences) and include the new header in both aahd_structs.h and cbcd_structs.h.

Comment on lines +70 to +200
/**
* \brief Class that represents a 64-bit integer encoding the index and the host/device information
* of a face node.
* \note The index follows the following mutually exclusive rules:
* - Boundary status is checked BEFORE local status. Although boundary nodes are treated as
* local nodes, their locations are set to an undefined value.
* - A node cannot be both a boundary node and a non-local node.
*/
class CBCD_NodeIndex
{
public:
/// Default constructor
CBCD_NodeIndex() = default;

/// Direct assign core value
constexpr CBCD_NodeIndex(const std::uint64_t& value) : value_(value) {}

/**
* Construct a non-boundary node index.
* \param index Index into the corresponding bank. Cannot exceed 2^60 - 1.
* \param is_outgoing Flag indicating if the node corresponds to an outgoing face.
* \param is_local Flag indicating if the index is in a local bank.
*/
CBCD_NodeIndex(std::uint64_t index, bool is_outgoing, bool is_local)
{
if (index >= (std::uint64_t(1) << 60) - 1)
throw std::runtime_error("Cannot hold an index greater than 2^60.");
SetInOut(is_outgoing);
SetLocal(is_local);
SetBoundary(false);
SetIndex(index);
}

/**
* Construct a boundary node index.
* \param index Index into the corresponding bank. Cannot exceed 2^40 - 1.
* \param is_outgoing Flag indicating if the node corresponds to an outgoing face.
*/
CBCD_NodeIndex(std::uint64_t index, bool is_outgoing)
{
if (index >= (std::uint64_t(1) << 60) - 1)
throw std::runtime_error("Cannot hold an index greater than 2^60.");
SetInOut(is_outgoing);
SetLocal(true);
SetBoundary(true);
SetIndex(index);
}

/// Check if the current node's index is undefined.
constexpr bool IsUndefined() const noexcept
{
return value_ == std::numeric_limits<std::uint64_t>::max();
}

/// Check if the node corresponds to an outgoing face.
constexpr bool IsOutgoing() const noexcept { return (value_ & inout_bit_mask) != 0; }

/// Check if the node corresponds to a boundary face.
constexpr bool IsBoundary() const noexcept { return (value_ & boundary_bit_mask) != 0; }

/// Check if the node index corresponds to a local face.
constexpr bool IsLocal() const noexcept { return (value_ & local_bit_mask) != 0; }

/// Get the node index into the appropriate host/device buffer.
constexpr std::uint64_t GetIndex() const noexcept { return value_ & index_bit_mask; }

/// Get the node's core value (can be used with the node tracker map to retrieve a given node's
/// index into the appropriate host/device buffer).
constexpr std::uint64_t GetCoreValue() const noexcept { return value_; }

private:
/// Incoming/outgoing bit
/// First bit mask (``1`` followed by 63 zeros).
static constexpr std::uint64_t inout_bit_mask = std::uint64_t(1) << (64 - 1);

/// Encode the node as incoming or outgoing.
constexpr void SetInOut(bool is_outgoing) noexcept
{
if (is_outgoing)
value_ |= inout_bit_mask;
else
value_ &= ~inout_bit_mask;
}

/// Boundary bit
/// Second bit mask (``01`` followed by 62 zeros).
static constexpr std::uint64_t boundary_bit_mask = std::uint64_t(1) << (64 - 2);

/// Encode the node as either a boundary node or non-boundary node.
constexpr void SetBoundary(bool is_boundary) noexcept
{
if (is_boundary)
value_ |= boundary_bit_mask;
else
value_ &= ~boundary_bit_mask;
}

/// Local bit
/// Third bit mask (``001`` followed by 61 zeros).
static constexpr std::uint64_t local_bit_mask = std::uint64_t(1) << (64 - 3);

/// Encode the node as either a local or non-local node.
constexpr void SetLocal(bool is_local) noexcept
{
if (is_local)
value_ |= local_bit_mask;
else
value_ &= ~local_bit_mask;
}

// Index bits
// Index bit mask (``1`` at the last 61 bits)
static constexpr std::uint64_t index_bit_mask = (std::uint64_t(1) << (64 - 3)) - 1;

/// Encode the node index.
constexpr void SetIndex(const std::uint64_t& index) noexcept
{
value_ &= ~index_bit_mask;
value_ |= (index & index_bit_mask);
}

/**
* \brief Core value.
* \details The value's bits contain the following information:
* - 1 bit for incoming/outgoing node
* - 1 bit for boundary or non-boundary node
* - 1 bit for local or non-local node
* - 61 bits for index into the appropriate host/device buffer
*/
std::uint64_t value_ = std::numeric_limits<std::uint64_t>::max();
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Factor the common implementations out and put them in the same file.
Next, use inheritance for the implementation of both CBCD_NodeIndex and AAHD_NodeIndex.

/// Move constructor
CBCD_FLUDSPointerSet(CBCD_FLUDSPointerSet&&) = default;

/// Copy and move assignment operator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only the copy assignment operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants