Skip to content

feat(hesai): add CUDA-accelerated point cloud decoder#421

Draft
k1832 wants to merge 2 commits intotier4:mainfrom
k1832:feat/core-cuda-decode
Draft

feat(hesai): add CUDA-accelerated point cloud decoder#421
k1832 wants to merge 2 commits intotier4:mainfrom
k1832:feat/core-cuda-decode

Conversation

@k1832
Copy link

@k1832 k1832 commented Mar 19, 2026

PR Type

  • New Feature

Related Links

Description

Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:

  • Compile-time opt-in: Build with -DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.
  • Runtime opt-in: Set NEBULA_USE_CUDA=1 environment variable. When unset, the existing CPU path is used with zero overhead.

What it does

  • Processes an entire scan in a single batched CUDA kernel launch (launch_decode_hesai_scan_batch)
  • Uses pre-computed angle lookup tables (azimuth/elevation) uploaded to GPU once at initialization
  • Supports calibration-based and correction-based angle correctors
  • Currently validated on OT128 (Pandar128E4X) sensor

Files changed

File Change
hesai_cuda_kernels.cu New CUDA kernel for batched point cloud decoding
hesai_cuda_decoder.hpp GPU buffer management, angle LUT, device memory
hesai_decoder.hpp Integration: GPU scan buffer, flush, result conversion
hesai_sensor.hpp Expose max_scan_buffer_points() for GPU buffer sizing
angle_corrector_*.hpp Expose angle LUT data for GPU upload
nebula_hesai_decoders/CMakeLists.txt CUDA library target, toolkit detection
nebula_hesai/CMakeLists.txt CUDA decoder test target
hesai_cuda_decoder_test.cpp 5 GPU-vs-CPU equivalence tests

Known limitations

  • GPU kernel does not set return_type field (always 0)
  • Scan boundary detection differs from CPU's ScanCutter, causing up to ~1850 points to shift between adjacent scans (out of ~72k per scan)

Review Procedure

Build (with CUDA)

colcon build --packages-up-to nebula_hesai \
  --cmake-args -DBUILD_CUDA=ON -DBUILD_TESTING=ON

Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.

Running with CUDA enabled

The GPU decode path is gated by a runtime environment variable:

# Enable GPU decoding
export NEBULA_USE_CUDA=1

# Launch the driver node as usual — it will log "GPU scan batching enabled" on startup
ros2 launch nebula_hesai ...

# To disable (default), unset the variable
unset NEBULA_USE_CUDA

Test

# Run all tests (132 existing + 5 new CUDA tests)
source install/setup.bash
colcon test --packages-select nebula_hesai --ctest-args -V

# Or run CUDA tests only
./build/nebula_hesai/hesai_cuda_decoder_test_main

Test results

[==========] Running 5 tests from 1 test suite.
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence
[       OK ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence (21778 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty
[       OK ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty (388 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuFieldValidity
[       OK ] HesaiCudaDecoderTest.OT128_GpuFieldValidity (378 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts
[       OK ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts (369 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_IntensityExactMatch
[       OK ] HesaiCudaDecoderTest.OT128_IntensityExactMatch (17217 ms)
[  PASSED  ] 5 tests.

# Full suite
Summary: 137 tests, 0 errors, 0 failures, 0 skipped

Remarks

  • When CUDA is not compiled in (BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime via GTEST_SKIP(), so they do not break CPU-only CI.
  • Tolerances in the equivalence tests were derived from a single OT128 rosbag. See test file header for observed values.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

k1832 added 2 commits March 19, 2026 14:57
Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time
BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable.

The implementation includes:
- CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu)
- Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp
- GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor

The GPU path processes an entire scan in a single kernel launch, using
pre-computed angle lookup tables and a sparse output buffer. When CUDA
is not available or NEBULA_USE_CUDA is unset, the existing CPU path is
used with zero overhead.
- Copyright year 2024 -> 2026 for new files
- Replace deprecated find_package(CUDA) with find_package(CUDAToolkit)
- Remove --expt-relaxed-constexpr flag (not needed)
- Remove unused per-packet kernel and launcher (dead code)
- Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM
- Reorder CudaNebulaPoint fields for better memory packing
- Remove redundant is_multi_frame member; use n_frames > 1
- Make HesaiCudaDecoder destructor virtual
- Add int32_t range guarantee comment in angle corrector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant