feat(hesai): add CUDA-accelerated point cloud decoder by k1832 · Pull Request #421 · tier4/nebula

k1832 · 2026-03-19T06:07:41Z

PR Type

New Feature

Description

Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:

Compile-time opt-in: Build with -DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.
Runtime opt-in: Set NEBULA_USE_CUDA=1 environment variable. When unset, the existing CPU path is used with zero overhead.

What it does

Processes an entire scan in a single batched CUDA kernel launch (launch_decode_hesai_scan_batch)
Uses pre-computed angle lookup tables (azimuth/elevation) uploaded to GPU once at initialization
Supports calibration-based and correction-based angle correctors
Currently validated on OT128 (Pandar128E4X) sensor

Files changed

File	Change
`hesai_cuda_kernels.cu`	New CUDA kernel for batched point cloud decoding
`hesai_cuda_decoder.hpp`	GPU buffer management, angle LUT, device memory
`hesai_decoder.hpp`	Integration: GPU scan buffer, flush, result conversion
`hesai_sensor.hpp`	Expose `max_scan_buffer_points()` for GPU buffer sizing
`angle_corrector_*.hpp`	Expose angle LUT data for GPU upload
`nebula_hesai_decoders/CMakeLists.txt`	CUDA library target, toolkit detection
`nebula_hesai/CMakeLists.txt`	CUDA decoder test target
`hesai_cuda_decoder_test.cpp`	5 GPU-vs-CPU equivalence tests

Known limitations

GPU kernel does not set return_type field (always 0)
Scan boundary detection differs from CPU's ScanCutter, causing up to ~1850 points to shift between adjacent scans (out of ~72k per scan)

Review Procedure

Build (with CUDA)

colcon build --packages-up-to nebula_hesai \
  --cmake-args -DBUILD_CUDA=ON -DBUILD_TESTING=ON

Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.

Running with CUDA enabled

The GPU decode path is gated by a runtime environment variable:

# Enable GPU decoding
export NEBULA_USE_CUDA=1

# Launch the driver node as usual — it will log "GPU scan batching enabled" on startup
ros2 launch nebula_hesai ...

# To disable (default), unset the variable
unset NEBULA_USE_CUDA

Test

# Run all tests (132 existing + 5 new CUDA tests)
source install/setup.bash
colcon test --packages-select nebula_hesai --ctest-args -V

# Or run CUDA tests only
./build/nebula_hesai/hesai_cuda_decoder_test_main

Test results

[==========] Running 5 tests from 1 test suite.
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence
[       OK ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence (21778 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty
[       OK ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty (388 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuFieldValidity
[       OK ] HesaiCudaDecoderTest.OT128_GpuFieldValidity (378 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts
[       OK ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts (369 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_IntensityExactMatch
[       OK ] HesaiCudaDecoderTest.OT128_IntensityExactMatch (17217 ms)
[  PASSED  ] 5 tests.

# Full suite
Summary: 137 tests, 0 errors, 0 failures, 0 skipped

Remarks

When CUDA is not compiled in (BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime via GTEST_SKIP(), so they do not break CPU-only CI.
Tolerances in the equivalence tests were derived from a single OT128 rosbag. See test file header for observed values.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

Commits are properly organized and messages are according to the guideline
(Optional) Unit tests have been written for new behavior
PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

All open points are addressed and tracked via issues or tickets

CI Checks

Build and test for PR: Required to pass before the merge.

Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable. The implementation includes: - CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu) - Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp - GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor The GPU path processes an entire scan in a single kernel launch, using pre-computed angle lookup tables and a sparse output buffer. When CUDA is not available or NEBULA_USE_CUDA is unset, the existing CPU path is used with zero overhead.

- Copyright year 2024 -> 2026 for new files - Replace deprecated find_package(CUDA) with find_package(CUDAToolkit) - Remove --expt-relaxed-constexpr flag (not needed) - Remove unused per-packet kernel and launcher (dead code) - Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM - Reorder CudaNebulaPoint fields for better memory packing - Remove redundant is_multi_frame member; use n_frames > 1 - Make HesaiCudaDecoder destructor virtual - Add int32_t range guarantee comment in angle corrector

k1832 added 2 commits March 19, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hesai): add CUDA-accelerated point cloud decoder#421

feat(hesai): add CUDA-accelerated point cloud decoder#421
k1832 wants to merge 2 commits intotier4:mainfrom
k1832:feat/core-cuda-decode

k1832 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k1832 commented Mar 19, 2026

PR Type

Related Links

Description

What it does

Files changed

Known limitations

Review Procedure

Build (with CUDA)

Running with CUDA enabled

Test

Test results

Remarks

Pre-Review Checklist for the PR Author

Checklist for the PR Reviewer

Post-Review Checklist for the PR Author

CI Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant