Pointelligence is a repository for 3D point cloud research, featuring:
- The official implementation of PointCNN++ (CVPR 2026) -- a significant next evolution of PointCNN (NeurIPS 2018).
- Pointelligence-MLX -- the first complete port of PointCNN++ to Apple MLX, enabling native 3D point cloud inference and training on Apple Silicon with zero CUDA dependency.
MLX port built by AIFLOW LABS / RobotFlow Labs -- pioneering Apple Silicon ML infrastructure for robotics and spatial intelligence.
| PyTorch + CUDA/Triton (Original) | MLX (Apple Silicon Port) | |
|---|---|---|
| Hardware | NVIDIA GPU + CUDA toolkit | Any Mac with Apple Silicon (M1-M4) |
| Kernel backend | Triton kernels (Linux only) + CUDA C++ | Pure Python + MLX Metal backend |
| Build step | torch.utils.cpp_extension compilation |
Zero compilation -- just pip install |
| Memory model | GPU memory transfers | Unified memory -- zero-copy CPU/GPU |
| Tests | PyTorch test suite | 344 tests, all passing |
| Training | Full pipeline | Full pipeline (verified loss convergence) |
| Location | Root directory | mlx_port/ directory |
Clone the repository with third-party submodules (FCGF and Pointcept) recursively:
git clone --recursive https://github.com/ant-research/pointelligence.git
cd pointelligenceFor reproducibility, checkout the following commits in the submodules:
# FCGF (examples/FCGF)
cd examples/FCGF && git checkout pointcnnpp-version && cd ../..
# Pointcept (examples/Pointcept)
cd examples/Pointcept && git checkout pointcnnpp-version && cd ../..If you have already cloned without --recursive, run git submodule update --init --recursive to fetch the submodules.
Some operators are implemented with C++/CUDA as PyTorch extensions, which could be built and installed with the following commands:
cd extensions
pip install --no-build-isolation -e .in examples/FCGF
in examples/Pointcept
A complete port of PointCNN++ from PyTorch + Triton + CUDA to Apple MLX, enabling native 3D point cloud inference and training on M1/M2/M3/M4 Macs.
This is not a toy demo. It's a complete, tested, trainable port of a production research framework with custom sparse convolution kernels.
cd mlx_port
# Setup with uv (recommended)
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"
# Verify installation
python -c "import pointelligence_mlx; print(f'v{pointelligence_mlx.__version__}')"
# Output: v0.1.0
# Run full test suite
pytest tests/ -v
# Output: 344 passedAlternative: pip install
cd mlx_port
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"| Kernel | Original | MLX Port | Purpose |
|---|---|---|---|
| MVMR | Triton + CUDA (warp shuffle, atomicAdd) | Blocked scatter-add with @mx.custom_function VJP |
Core conv forward -- sparse matrix-vector multiply-reduce |
| VVOR | Triton + CUDA (warp shuffle, atomicAdd) | Blocked outer product + scatter-add with VJP | Core conv backward -- weight gradient computation |
| Indexed Distance | Triton (11 autotune configs) | Vectorized MLX gather + reduce | L2/Chebyshev distance for neighbor filtering |
| Indexed Segment Reduce | Triton (4 autotune configs) | Scatter-add/max/min with @mx.custom_function |
Per-segment sum/mean/max/min with gather indirection |
| Large Segment Reduce | Triton (dual-path fast/slow) | Segment-ID scatter with @mx.custom_function |
Contiguous segment reduction for norms and pooling |
Every kernel has a custom backward pass (@mx.custom_function + .vjp) enabling full gradient flow for training.
| Layer | What It Does |
|---|---|
PointConv3d |
Sparse 3D convolution on native point clouds via MVMR |
RaggedLayerNorm / InstanceNorm / BatchNorm / GroupNorm |
Normalization over variable-length batches |
GlobalPool |
Per-sample mean/max/min aggregation |
max_pool3d |
3D max pooling with stride |
voxelize_3d + build_triplets |
Spatial indexing: radius search, grid hashing, kernel assignment |
Downsample / Upsample |
Resolution changes via grid sampling |
MultiSequential |
Sequential container for (features, metadata) tuples |
MetaData |
Ragged batch container with lazy triplet building |
| Component | What It Does |
|---|---|
| Grid Lookup | O(1) spatial hashing for 3D points |
| Radius Search | Grid-accelerated neighbor finding |
| Grid Sample | Voxel-based point cloud downsampling (4 modes) |
| Ragged Indexing | repeat_interleave, arrange_indices, cumsum_exclusive |
The key innovation is replacing CUDA/Triton's atomicAdd scatter pattern with MLX's functional arr.at[indices].add(values):
# CUDA/Triton: atomicAdd to shared output
tl.atomic_add(output_ptr + o_idx, accumulated_value)
# MLX: functional scatter-add (handles duplicate indices correctly)
output = output.at[o_idx].add(products)Custom backward passes use @mx.custom_function with .vjp:
- MVMR backward: grad_weights via VVOR, grad_inputs via transposed MVMR
- VVOR backward: both gradients via MVMR with transposed arguments
- Late imports resolve the MVMR-VVOR circular dependency
See mlx_port/prds/ for full technical specifications of each ported component (PRD-01 through PRD-12).
344 tests across 12 test modules, validated from a fresh install:
tests/test_smoke.py 34 tests Package imports, MLX env, stub verification
tests/test_harness.py 37 tests Tensor conversion, check_all_close, gradient check
tests/test_ragged_tensors.py 59 tests Cumsum, repeat_interleave, arrange_indices, MetaData
tests/test_indexed_distance.py 23 tests L2/Chebyshev distance, cross-framework vs PyTorch
tests/test_indexed_segment_reduce.py 36 tests SUM/MEAN/MAX/MIN forward + backward, C in {1..128}
tests/test_large_segment_reduce.py 26 tests 5 segment configs, float16, backward gradients
tests/test_mvmr.py 24 tests Forward correctness, backward VJP, production scale
tests/test_vvor.py 20 tests Forward correctness, backward VJP, MVMR integration
tests/test_grid_neighbors.py 29 tests Grid lookup, radius search vs scipy, grid sample
tests/test_core_layers.py 25 tests PointConv3d, GlobalPool, triplets, norms, downsample
tests/test_resnet.py 22 tests All 10 model variants, forward pass, BasicBlock/Bottleneck
tests/test_training.py 9 tests Optimizer creation, train step, loss convergence
─────────────────────────────────────────────────────────────────────────
TOTAL 344 tests All passing (verified from fresh /tmp install)
Tests compare MLX outputs against PyTorch references with operation-specific tolerances:
| Operation | Tolerance | Why |
|---|---|---|
| Distance (L2) | atol=1e-5 | Simple math, near-exact |
| Segment reduce (sum/mean) | atol=1e-4 | Accumulation order may differ |
| Segment reduce (max/min) | atol=1e-6 | Exact comparison |
| MVMR/VVOR | atol=1e-2 | Atomic scatter-add ordering |
| ResNet forward | atol=1e-3 | Deep composition |
# Run with verbose output
cd mlx_port && pytest tests/ -v --tb=short
# Run specific kernel tests
pytest tests/test_mvmr.py -v
# Run with PyTorch cross-framework comparison
pytest tests/ -v -m "requires_torch"
# Benchmarks
pytest tests/ -v -m "benchmark"10 ResNet variants for 3D point cloud classification:
| Model | Block | Layers | Use Case |
|---|---|---|---|
resnet18 |
BasicBlock | [2,2,2,2] | Fast prototyping, small datasets |
resnet34 |
BasicBlock | [3,4,6,3] | Balanced speed/accuracy |
resnet50 |
Bottleneck | [3,4,6,3] | Standard benchmark model |
resnet101 |
Bottleneck | [3,4,23,3] | High accuracy |
resnet152 |
Bottleneck | [3,8,36,3] | Maximum accuracy |
resnext50_32x4d |
Bottleneck | [3,4,6,3] | Grouped convolutions |
resnext101_32x8d |
Bottleneck | [3,4,23,3] | Large grouped model |
resnext101_64x4d |
Bottleneck | [3,4,23,3] | Wide grouped model |
wide_resnet50_2 |
Bottleneck | [3,4,6,3] | Double-width channels |
wide_resnet101_2 |
Bottleneck | [3,4,23,3] | Maximum capacity |
from pointelligence_mlx.models.resnet import resnet18, resnet50
# Classification on 40 classes (e.g., ModelNet40)
model = resnet18(num_classes=40, in_channels=1)
# With more features
model = resnet50(num_classes=40, in_channels=6) # XYZ + normalsEnd-to-end training verified -- loss converges from ~1.8 to ~0.0 on synthetic data:
import mlx.core as mx
from pointelligence_mlx.models.resnet import resnet18
from pointelligence_mlx.training.trainer import create_optimizer, create_loss_and_grad_fn, train_step
from pointelligence_mlx.training.data import generate_synthetic_batch
# Create model and optimizer
model = resnet18(num_classes=10, in_channels=1)
optimizer = create_optimizer(lr=0.01, opt_type="sgd")
loss_and_grad_fn = create_loss_and_grad_fn(optimizer)
# Training loop
for epoch in range(50):
x, points, sizes, grid_size, labels = generate_synthetic_batch(
batch_size=4, points_per_sample=100, num_classes=10
)
loss = train_step(model, x, points, sizes, grid_size, labels, loss_and_grad_fn)
mx.eval(loss)
print(f"Epoch {epoch:3d}: loss={loss.item():.4f}")create_optimizer(lr=0.1, momentum=0.9, weight_decay=1e-4, opt_type="sgd")
create_optimizer(lr=0.001, opt_type="adam")
create_optimizer(lr=0.001, weight_decay=0.01, opt_type="adamw")mlx_port/
src/pointelligence_mlx/
sparse_engines/ # 5 custom MLX kernels (Triton/CUDA replacements)
mvmr.py # Sparse matrix-vector multiply-reduce + VJP
vvor.py # Sparse vector-vector outer product reduce + VJP
indexed_distance.py # L2/L-inf pairwise distance via index pairs
indexed_segment_reduce.py # Per-segment reduction with gather + VJP
large_segment_reduce.py # Contiguous segment reduction + VJP
metal/ # Reserved for Metal shader sources
internals/ # Spatial data structures & ragged tensor utilities
constants.py # Cached tensor allocations (shifts, arange buffer)
indexing.py # repeat_interleave, arrange_indices, cumsum
grid_lookup.py # Spatial hashing, LookupStruct, grid queries
grid_sample.py # Voxel-based point cloud downsampling
neighbors.py # Radius search, neighbor clipping, nearest neighbor
layers/ # Neural network layers (mlx.nn.Module)
conv.py # PointConv3d, GeneralConv, conv_with_stride
norm.py # RaggedBatch/Instance/Layer/GroupNorm
pooling.py # GlobalPool, max_pool3d
triplets.py # voxelize_3d, build_triplets, handle_stride
downsample.py # Grid-sample downsampling
upsample.py # Learnable upsampling with cached triplets
metadata.py # MetaData ragged batch container
multi_sequential.py # Multi-input sequential container
models/ # Complete classification models
resnet.py # 10 ResNet/ResNeXt/WideResNet variants
training/ # Training pipeline
trainer.py # train_step, create_optimizer, loss+grad
data.py # Synthetic 3D shape generation
tests/ # 344 tests across 12 modules
prds/ # PRD-01 through PRD-12: technical specifications
| Requirement | Version |
|---|---|
| macOS | Apple Silicon (M1/M2/M3/M4) |
| Python | >= 3.10 |
| MLX | >= 0.31.0 |
| NumPy | >= 1.24.0 |
| SciPy | >= 1.10.0 |
Development extras (for testing):
uv pip install -e ".[dev]" # Adds pytest, pytest-benchmark, torchWhile we provide a suite of ready-to-use backbones, our framework is explicitly designed to facilitate the construction of custom network architectures from scratch. We review several key concepts below; combined with the reference implementations in the models directory, these resources are intended to help users quickly master the library's workflow.
The total number of points often varies significantly from one sample to another within a single batch. As illustrated below, the straightforward approach deals with this irregularity (e.g., samples having 2, 7, and 4 points) by forcing data into fixed-size dense tensors. While this satisfies the rigid structural requirements of standard frameworks, it overlooks the data's inherent sparsity. Smaller samples must be padded out with non-existent "ghost" data, squandering significant memory and compute cycles on empty space.
Scenario: A batch of 3 irregular samples with 2, 7, and 4 points respectively.
Legend:
[P] = Valid Point/Feature Data
[.] = "Ghost Data" (Padding/Wasted Memory)
+-----------------------------------------------------------+
| THE STRAIGHTFORWARD APPROACH: Fixed-Size Dense Tensor |
| Status: WASTEFUL. Forces data to match the largest dim. |
+-----------------------------------------------------------+
To create a uniform grid, every sample must be padded to match
the largest necessary dimension (at least 7).
Batch Memory Layout (Fixed grid):
Row 0 (Sample 1): [P][P][.][.][.][.][.]
Row 1 (Sample 2): [P][P][P][P][P][P][P]
Row 2 (Sample 3): [P][P][P][P][.][.][.]
^Valid^ ^Wasted^
VISUAL RESULT: Significant portions of memory are useless padding.
Ragged tensors represent a dedicated solution to the inefficiency shown above. As visualized below, this format is explicitly designed to handle irregularity. Instead of maintaining separate, padded rows in a grid, the Ragged tensor stores the entire batch as a single, contiguous sequence containing only valid points. This packed data is managed by a lightweight auxiliary metadata structure that tracks the individual sample sizes. By eliminating padding entirely, this approach ensures processing applies only to actual geometric and feature data.
+-----------------------------------------------------------+
| THE DEDICATED SOLUTION: Ragged Tensor |
| Status: EFFICIENT. Stores only what exists. |
+-----------------------------------------------------------+
[A. Contiguous Data Storage]
Padding is eliminated. The entire batch is flattened into one
tightly packed sequence of valid data only:
Memory Layout:
[P][P] [P][P][P][P][P][P][P] [P][P][P][P]
\___/ \___________________/ \_________/
S1(2) S2(7) S3(4)
VISUAL RESULT: Zero wasted space.
[B. Auxiliary Metadata]
A small separate structure tracks where samples sizes:
Sample Sizes: [ 2, 7, 4 ]
We employ a voxel-based strategy using the
Because the
The
Unlike in image processing where target pixel locations are fixed, the spatial locations of upsampled points in a point cloud are not inherently known. Therefore, a recommended practice is to explicitly reuse the retained set of original, pre-downsampled points as the upsampling target. This approach is both efficient and unambiguous. Note that the retention and retrieval of these original points must be managed by the enclosing pipeline or calling procedure.
We opt for a fixed radius search over a fixed-number (K-Nearest Neighbors, or KNN) search, as its spatially-local receptive field is better suited for spatial learning, whereas KNN is often a choice imposed by architectural limitations. Given sets of "source" and "query" points—which may be identical—and a specified radius, the
As illustrated in the vertical flow diagram below, the connectivity of a batched point cloud is most efficiently represented as a unified list of
[-SAMPLE 1-] [-------------SAMPLE 2--------------] [-----------SAMPLE 3-----------]
Query (i): 0 1 2 3 4 5 6 7 8 9 10 11 12
| | | | | | | | | | | | |
|-. |-. | |-.-. |-.-. |-.-. |-. |-. | |--.--.--. |--. |--. |--.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | |
v v v v | v v v v v v v v v v v v v | v v v v v v v v v v
Neighbor (j): 0 1 0 1 2 3 4 5 3 4 5 3 4 5 6 7 6 7 8 9 10 11 12 9 10 9 11 9 12
The connections from the diagram above, flattened into two parallel arrays:
[-SAMPLE 1-] [-------------SAMPLE 2--------------] [-----------SAMPLE 3-----------]
Query (i): 0 0 1 1 2 3 3 3 4 4 4 5 5 5 6 6 7 7 8 9 9 9 9 10 10 11 11 12 12
Neighbor (j): 0 1 0 1 2 3 4 5 3 4 5 3 4 5 6 7 6 7 8 9 10 11 12 9 10 9 11 9 12
+---------------------------------------------------------------------------------+
| THE UNIFIED LIST: (i, j) pairs |
| Status: EFFICIENT. A compact stream of all neighborhood edges in the batch. |
+---------------------------------------------------------------------------------+
The full process of convolution on native points involves four steps: output location generation, neighborhood search, convolution triplet construction and the Matrix-Vector Multiplication and Reduction (MVMR). The actual convolution computation occurs in the final MVMR stage, while the preceding three steps serve to structure the input data for this calculation.
This step defines the spatial centers for the convolution operations. Depending on the desired architectural effect, the output locations are generated in one of three ways:
- Standard Convolution (
$stride=1$ ): The input points serve directly as the output locations. - Strided Convolution (
$stride > 1$ ): Output locations are generated by downsampling the input points via$grid_sample_filter$ with a target grid size of$grid_size_{input} \times stride$ (see Downsampling). Note that the$grid_size$ of the output point cloud would be$grid_size_{input} \times stride$ . - Transposed Convolution (Upconvolution): The output locations are explicitly set to the pre-calculated 'upsampled' points, as described in the Upsampling section.
This step executes the neighbor finding process detailed in Neighborhood Computation and Representation by invoking the
The search radius is determined by the formula
-
Inscribed Sphere:
$radius_scaler = \frac{1}{2} \times receptive_field$ . The search ball is the largest sphere that fits inside the${receptive_field}^3$ cube. -
Equal Volume (Default):
$radius_scaler = \sqrt[3]{\frac{3}{4 \pi}} \times receptive_field$ . The search ball has the same volume as the${receptive_field}^3$ cube. -
Circumscribed Sphere:
$radius_scaler = \frac{\sqrt{3}}{2} \times receptive_field$ . The search ball is the smallest sphere that encloses the${receptive_field}^3$ cube.
To perform convolution on the irregular structure as depicted by the neighborhood
[-SAMPLE 1-] [-------------SAMPLE 2--------------] [-----------SAMPLE 3-----------]
Query (i): 0 1 2 3 4 5 6 7 8 9 10 11 12
| | | | | | | | | | | | |
|-. |-. | |-.-. |-.-. |-.-. |-. |-. | |--.--.--. |--. |--. |--.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | |
v v v v | v v v v v v v v v v v v v | v v v v v v v v v v
Neighbor (j): 0 1 0 1 2 3 4 5 3 4 5 3 4 5 6 7 6 7 8 9 10 11 12 9 10 9 11 9 12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | |
v v v v v v v v v v v v v v v v v v v v v v v v v v v v v
Kernel (k): 4 5 5 4 4 4 6 8 2 4 5 1 3 4 4 7 1 4 4 4 2 6 8 6 4 2 4 0 4
The connections from the diagram above, flattened into three parallel arrays:
[-SAMPLE 1-] [-------------SAMPLE 2--------------] [-----------SAMPLE 3-----------]
Query (i): 0 0 1 1 2 3 3 3 4 4 4 5 5 5 6 6 7 7 8 9 9 9 9 10 10 11 11 12 12
Neighbor (j): 0 1 0 1 2 3 4 5 3 4 5 3 4 5 6 7 6 7 8 9 10 11 12 9 10 9 11 9 12
Kernel (k): 4 5 5 4 4 4 6 8 2 4 5 1 3 4 4 7 1 4 4 4 2 6 8 6 4 2 4 0 4
+---------------------------------------------------------------------------------+
| THE UNIFIED LIST: (i, j, k) triplets |
| Status: EFFICIENT. A compact stream of all computation edges in the batch. |
+---------------------------------------------------------------------------------+
More specifically, to determine the kernel index
It is crucial to distinguish between
As detailed in the research paper of PointCNN++, the actual heavy lifting of the convolution arithmetic occurs in this final stage. To encapsulate this complexity, we provide a high-level
It is worth emphasizing that in this framework, feature tensors are first-class citizens. They are the primary carriers of the learned signal and the subject of all gradient backpropagation. The spatial coordinates, having served their purpose in generating the neighbor lists and triplets, are treated simply as "metadata" that guides the data-weight flow, rather than being part of the arithmetic computation itself.
Pointelligence is the repo for the official implementation of:
- PointCNN++: Performant Convolution on Native Points
Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li@misc{li2025pointcnnperformantconvolutionnative, title={PointCNN++: Performant Convolution on Native Points}, author={Lihan Li and Haofeng Zhong and Rui Bu and Mingchao Sun and Wenzheng Chen and Baoquan Chen and Yangyan Li}, year={2025}, eprint={2511.23227}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.23227}, }
To ensure they are tracked effectively, please submit feature requests and issue reports here rather than via email.
MIT -- see LICENSE for details.
MLX port by AIFLOW LABS / RobotFlow Labs
Pioneering Apple Silicon ML infrastructure for robotics and spatial intelligence