Pointelligence
_{^{Accelerating Point Cloud Learning for Spatial Intelligence}}

Pointelligence is a repository for 3D point cloud research, featuring:

The official implementation of PointCNN++ (CVPR 2026) -- a significant next evolution of PointCNN (NeurIPS 2018).
Pointelligence-MLX -- the first complete port of PointCNN++ to Apple MLX, enabling native 3D point cloud inference and training on Apple Silicon with zero CUDA dependency.

MLX port built by AIFLOW LABS / RobotFlow Labs -- pioneering Apple Silicon ML infrastructure for robotics and spatial intelligence.

Two Backends, One Framework

	PyTorch + CUDA/Triton (Original)	MLX (Apple Silicon Port)
Hardware	NVIDIA GPU + CUDA toolkit	Any Mac with Apple Silicon (M1-M4)
Kernel backend	Triton kernels (Linux only) + CUDA C++	Pure Python + MLX Metal backend
Build step	`torch.utils.cpp_extension` compilation	Zero compilation -- just `pip install`
Memory model	GPU memory transfers	Unified memory -- zero-copy CPU/GPU
Tests	PyTorch test suite	344 tests, all passing
Training	Full pipeline	Full pipeline (verified loss convergence)
Location	Root directory	`mlx_port/` directory

PyTorch Backend (Original)

Clone the Repository

Clone the repository with third-party submodules (FCGF and Pointcept) recursively:

git clone --recursive https://github.com/ant-research/pointelligence.git
cd pointelligence

For reproducibility, checkout the following commits in the submodules:

# FCGF (examples/FCGF)
cd examples/FCGF && git checkout pointcnnpp-version && cd ../..

# Pointcept (examples/Pointcept)
cd examples/Pointcept && git checkout pointcnnpp-version && cd ../..

If you have already cloned without --recursive, run git submodule update --init --recursive to fetch the submodules.

Installation

Some operators are implemented with C++/CUDA as PyTorch extensions, which could be built and installed with the following commands:

cd extensions
pip install --no-build-isolation -e .

Basic Usages

Point Cloud Registration Task

in examples/FCGF

Point Cloud Segmentation Task

in examples/Pointcept

MLX Backend (Apple Silicon)

A complete port of PointCNN++ from PyTorch + Triton + CUDA to Apple MLX, enabling native 3D point cloud inference and training on M1/M2/M3/M4 Macs.

This is not a toy demo. It's a complete, tested, trainable port of a production research framework with custom sparse convolution kernels.

Quick Start

cd mlx_port

# Setup with uv (recommended)
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"

# Verify installation
python -c "import pointelligence_mlx; print(f'v{pointelligence_mlx.__version__}')"
# Output: v0.1.0

# Run full test suite
pytest tests/ -v
# Output: 344 passed

Alternative: pip install

cd mlx_port
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

What Was Ported

5 Custom Sparse Kernels (Triton/CUDA to MLX)

Kernel	Original	MLX Port	Purpose
MVMR	Triton + CUDA (warp shuffle, atomicAdd)	Blocked scatter-add with `@mx.custom_function` VJP	Core conv forward -- sparse matrix-vector multiply-reduce
VVOR	Triton + CUDA (warp shuffle, atomicAdd)	Blocked outer product + scatter-add with VJP	Core conv backward -- weight gradient computation
Indexed Distance	Triton (11 autotune configs)	Vectorized MLX gather + reduce	L2/Chebyshev distance for neighbor filtering
Indexed Segment Reduce	Triton (4 autotune configs)	Scatter-add/max/min with `@mx.custom_function`	Per-segment sum/mean/max/min with gather indirection
Large Segment Reduce	Triton (dual-path fast/slow)	Segment-ID scatter with `@mx.custom_function`	Contiguous segment reduction for norms and pooling

Every kernel has a custom backward pass (@mx.custom_function + .vjp) enabling full gradient flow for training.

Complete Layer Stack

Layer	What It Does
`PointConv3d`	Sparse 3D convolution on native point clouds via MVMR
`RaggedLayerNorm` / `InstanceNorm` / `BatchNorm` / `GroupNorm`	Normalization over variable-length batches
`GlobalPool`	Per-sample mean/max/min aggregation
`max_pool3d`	3D max pooling with stride
`voxelize_3d` + `build_triplets`	Spatial indexing: radius search, grid hashing, kernel assignment
`Downsample` / `Upsample`	Resolution changes via grid sampling
`MultiSequential`	Sequential container for `(features, metadata)` tuples
`MetaData`	Ragged batch container with lazy triplet building

Spatial Infrastructure

Component	What It Does
Grid Lookup	O(1) spatial hashing for 3D points
Radius Search	Grid-accelerated neighbor finding
Grid Sample	Voxel-based point cloud downsampling (4 modes)
Ragged Indexing	`repeat_interleave`, `arrange_indices`, `cumsum_exclusive`

How the Port Works

The key innovation is replacing CUDA/Triton's atomicAdd scatter pattern with MLX's functional arr.at[indices].add(values):

# CUDA/Triton: atomicAdd to shared output
tl.atomic_add(output_ptr + o_idx, accumulated_value)

# MLX: functional scatter-add (handles duplicate indices correctly)
output = output.at[o_idx].add(products)

Custom backward passes use @mx.custom_function with .vjp:

MVMR backward: grad_weights via VVOR, grad_inputs via transposed MVMR
VVOR backward: both gradients via MVMR with transposed arguments
Late imports resolve the MVMR-VVOR circular dependency

See mlx_port/prds/ for full technical specifications of each ported component (PRD-01 through PRD-12).

MLX Test Suite

344 tests across 12 test modules, validated from a fresh install:

tests/test_smoke.py                    34 tests   Package imports, MLX env, stub verification
tests/test_harness.py                  37 tests   Tensor conversion, check_all_close, gradient check
tests/test_ragged_tensors.py           59 tests   Cumsum, repeat_interleave, arrange_indices, MetaData
tests/test_indexed_distance.py         23 tests   L2/Chebyshev distance, cross-framework vs PyTorch
tests/test_indexed_segment_reduce.py   36 tests   SUM/MEAN/MAX/MIN forward + backward, C in {1..128}
tests/test_large_segment_reduce.py     26 tests   5 segment configs, float16, backward gradients
tests/test_mvmr.py                     24 tests   Forward correctness, backward VJP, production scale
tests/test_vvor.py                     20 tests   Forward correctness, backward VJP, MVMR integration
tests/test_grid_neighbors.py           29 tests   Grid lookup, radius search vs scipy, grid sample
tests/test_core_layers.py              25 tests   PointConv3d, GlobalPool, triplets, norms, downsample
tests/test_resnet.py                   22 tests   All 10 model variants, forward pass, BasicBlock/Bottleneck
tests/test_training.py                  9 tests   Optimizer creation, train step, loss convergence
─────────────────────────────────────────────────────────────────────────
TOTAL                                 344 tests   All passing (verified from fresh /tmp install)

Cross-Framework Validation

Tests compare MLX outputs against PyTorch references with operation-specific tolerances:

Operation	Tolerance	Why
Distance (L2)	atol=1e-5	Simple math, near-exact
Segment reduce (sum/mean)	atol=1e-4	Accumulation order may differ
Segment reduce (max/min)	atol=1e-6	Exact comparison
MVMR/VVOR	atol=1e-2	Atomic scatter-add ordering
ResNet forward	atol=1e-3	Deep composition

# Run with verbose output
cd mlx_port && pytest tests/ -v --tb=short

# Run specific kernel tests
pytest tests/test_mvmr.py -v

# Run with PyTorch cross-framework comparison
pytest tests/ -v -m "requires_torch"

# Benchmarks
pytest tests/ -v -m "benchmark"

MLX Models

10 ResNet variants for 3D point cloud classification:

Model	Block	Layers	Use Case
`resnet18`	BasicBlock	[2,2,2,2]	Fast prototyping, small datasets
`resnet34`	BasicBlock	[3,4,6,3]	Balanced speed/accuracy
`resnet50`	Bottleneck	[3,4,6,3]	Standard benchmark model
`resnet101`	Bottleneck	[3,4,23,3]	High accuracy
`resnet152`	Bottleneck	[3,8,36,3]	Maximum accuracy
`resnext50_32x4d`	Bottleneck	[3,4,6,3]	Grouped convolutions
`resnext101_32x8d`	Bottleneck	[3,4,23,3]	Large grouped model
`resnext101_64x4d`	Bottleneck	[3,4,23,3]	Wide grouped model
`wide_resnet50_2`	Bottleneck	[3,4,6,3]	Double-width channels
`wide_resnet101_2`	Bottleneck	[3,4,23,3]	Maximum capacity

from pointelligence_mlx.models.resnet import resnet18, resnet50

# Classification on 40 classes (e.g., ModelNet40)
model = resnet18(num_classes=40, in_channels=1)

# With more features
model = resnet50(num_classes=40, in_channels=6)  # XYZ + normals

MLX Training

End-to-end training verified -- loss converges from ~1.8 to ~0.0 on synthetic data:

import mlx.core as mx
from pointelligence_mlx.models.resnet import resnet18
from pointelligence_mlx.training.trainer import create_optimizer, create_loss_and_grad_fn, train_step
from pointelligence_mlx.training.data import generate_synthetic_batch

# Create model and optimizer
model = resnet18(num_classes=10, in_channels=1)
optimizer = create_optimizer(lr=0.01, opt_type="sgd")
loss_and_grad_fn = create_loss_and_grad_fn(optimizer)

# Training loop
for epoch in range(50):
    x, points, sizes, grid_size, labels = generate_synthetic_batch(
        batch_size=4, points_per_sample=100, num_classes=10
    )
    loss = train_step(model, x, points, sizes, grid_size, labels, loss_and_grad_fn)
    mx.eval(loss)
    print(f"Epoch {epoch:3d}: loss={loss.item():.4f}")

Optimizer Support

create_optimizer(lr=0.1, momentum=0.9, weight_decay=1e-4, opt_type="sgd")
create_optimizer(lr=0.001, opt_type="adam")
create_optimizer(lr=0.001, weight_decay=0.01, opt_type="adamw")

MLX Architecture

mlx_port/
  src/pointelligence_mlx/
    sparse_engines/           # 5 custom MLX kernels (Triton/CUDA replacements)
      mvmr.py                 #   Sparse matrix-vector multiply-reduce + VJP
      vvor.py                 #   Sparse vector-vector outer product reduce + VJP
      indexed_distance.py     #   L2/L-inf pairwise distance via index pairs
      indexed_segment_reduce.py  # Per-segment reduction with gather + VJP
      large_segment_reduce.py #   Contiguous segment reduction + VJP
      metal/                  #   Reserved for Metal shader sources
    internals/                # Spatial data structures & ragged tensor utilities
      constants.py            #   Cached tensor allocations (shifts, arange buffer)
      indexing.py             #   repeat_interleave, arrange_indices, cumsum
      grid_lookup.py          #   Spatial hashing, LookupStruct, grid queries
      grid_sample.py          #   Voxel-based point cloud downsampling
      neighbors.py            #   Radius search, neighbor clipping, nearest neighbor
    layers/                   # Neural network layers (mlx.nn.Module)
      conv.py                 #   PointConv3d, GeneralConv, conv_with_stride
      norm.py                 #   RaggedBatch/Instance/Layer/GroupNorm
      pooling.py              #   GlobalPool, max_pool3d
      triplets.py             #   voxelize_3d, build_triplets, handle_stride
      downsample.py           #   Grid-sample downsampling
      upsample.py             #   Learnable upsampling with cached triplets
      metadata.py             #   MetaData ragged batch container
      multi_sequential.py     #   Multi-input sequential container
    models/                   # Complete classification models
      resnet.py               #   10 ResNet/ResNeXt/WideResNet variants
    training/                 # Training pipeline
      trainer.py              #   train_step, create_optimizer, loss+grad
      data.py                 #   Synthetic 3D shape generation
  tests/                      # 344 tests across 12 modules
  prds/                       # PRD-01 through PRD-12: technical specifications

MLX Requirements

Requirement	Version
macOS	Apple Silicon (M1/M2/M3/M4)
Python	>= 3.10
MLX	>= 0.31.0
NumPy	>= 1.24.0
SciPy	>= 1.10.0

Development extras (for testing):

uv pip install -e ".[dev]"  # Adds pytest, pytest-benchmark, torch

Advanced Topics

While we provide a suite of ready-to-use backbones, our framework is explicitly designed to facilitate the construction of custom network architectures from scratch. We review several key concepts below; combined with the reference implementations in the models directory, these resources are intended to help users quickly master the library's workflow.

Ragged Tensors for Efficient Batching

The Inefficiency of Padding

The total number of points often varies significantly from one sample to another within a single batch. As illustrated below, the straightforward approach deals with this irregularity (e.g., samples having 2, 7, and 4 points) by forcing data into fixed-size dense tensors. While this satisfies the rigid structural requirements of standard frameworks, it overlooks the data's inherent sparsity. Smaller samples must be padded out with non-existent "ghost" data, squandering significant memory and compute cycles on empty space.

Scenario: A batch of 3 irregular samples with 2, 7, and 4 points respectively.

Legend:
[P] = Valid Point/Feature Data
[.] = "Ghost Data" (Padding/Wasted Memory)


+-----------------------------------------------------------+
| THE STRAIGHTFORWARD APPROACH: Fixed-Size Dense Tensor     |
| Status: WASTEFUL. Forces data to match the largest dim.   |
+-----------------------------------------------------------+
To create a uniform grid, every sample must be padded to match
the largest necessary dimension (at least 7).

Batch Memory Layout (Fixed grid):
Row 0 (Sample 1):  [P][P][.][.][.][.][.]
Row 1 (Sample 2):  [P][P][P][P][P][P][P]
Row 2 (Sample 3):  [P][P][P][P][.][.][.]
                    ^Valid^      ^Wasted^

VISUAL RESULT: Significant portions of memory are useless padding.

The Ragged Tensor Solution

Ragged tensors represent a dedicated solution to the inefficiency shown above. As visualized below, this format is explicitly designed to handle irregularity. Instead of maintaining separate, padded rows in a grid, the Ragged tensor stores the entire batch as a single, contiguous sequence containing only valid points. This packed data is managed by a lightweight auxiliary metadata structure that tracks the individual sample sizes. By eliminating padding entirely, this approach ensures processing applies only to actual geometric and feature data.

+-----------------------------------------------------------+
| THE DEDICATED SOLUTION: Ragged Tensor                     |
| Status: EFFICIENT. Stores only what exists.               |
+-----------------------------------------------------------+

[A. Contiguous Data Storage]
Padding is eliminated. The entire batch is flattened into one
tightly packed sequence of valid data only:

Memory Layout:
[P][P] [P][P][P][P][P][P][P] [P][P][P][P]
\___/  \___________________/  \_________/
 S1(2)          S2(7)            S3(4)

VISUAL RESULT: Zero wasted space.


[B. Auxiliary Metadata]
A small separate structure tracks where samples sizes:

Sample Sizes:  [ 2,  7,  4 ]

Downsampling and Upsampling

Downsampling

We employ a voxel-based strategy using the $grid_sample_filter$ routine with $center_nearest$ reduction mode. Instead of snapping coordinates to the grid, this method selects the point nearest to the voxel center, which preserves thin structures and sparse regions better than random downsampling while maintaining comparable negligible latency. To ensure efficiency, the routine processes batched samples as a single, unified point cloud, leveraging highly optimized GPU sorting and search primitives.

Because the $grid_sample_filter$ routine relies on the $grid_size$ argument, the resolution of the processed point cloud is directly tied to this parameter. Conceptually, $grid_size$ acts as the fundamental spatial unit, analogous to pixel size in the image domain. With this baseline established, architectural concepts such as $receptive_field$ and $strides$ can be intuitively defined as relative multiples of $grid_size$.

The $grid_size$ parameter governs the critical trade-off between computational efficiency and model accuracy. A smaller $grid_size$ enhances spatial resolution—potentially improving accuracy—but incurs greater computation and memory overhead. Conversely, a larger $grid_size$ reduces resource demands but may sacrifice fine-grained detail. To mitigate issues from irregular data, such as excessive point concentrations in small regions, we recommend applying a preliminary 'sanity' downsample (e.g., at $\frac{1}{3} \times grid_size$) before feeding the data into a network configured for the target $grid_size$.

Upsampling

Unlike in image processing where target pixel locations are fixed, the spatial locations of upsampled points in a point cloud are not inherently known. Therefore, a recommended practice is to explicitly reuse the retained set of original, pre-downsampled points as the upsampling target. This approach is both efficient and unambiguous. Note that the retention and retrieval of these original points must be managed by the enclosing pipeline or calling procedure.

Neighborhood Computation and Representation

We opt for a fixed radius search over a fixed-number (K-Nearest Neighbors, or KNN) search, as its spatially-local receptive field is better suited for spatial learning, whereas KNN is often a choice imposed by architectural limitations. Given sets of "source" and "query" points—which may be identical—and a specified radius, the $radius_search$ routine computes neighborhood points using highly optimized GPU sorting and search primitives. This routine supports batched point cloud samples by processing the entire batch as a single, unified point cloud, while ensuring that the search remains logically confined to each individual sample.

As illustrated in the vertical flow diagram below, the connectivity of a batched point cloud is most efficiently represented as a unified list of $(i, j)$ pairs, effectively "flattening" the complex topology of multiple samples into a compact stream of edges. In this scheme, every point in the batch is assigned a unique global index, allowing neighborhoods to be defined simply by the link between a query point $i$ and its neighbor $j$. This approach elegantly stores only the valid edges that actually exist. Crucially, no lines cross the empty gaps between Sample 1, Sample 2, and Sample 3.

             [-SAMPLE 1-]   [-------------SAMPLE 2--------------]  [-----------SAMPLE 3-----------]
Query (i):     0    1        2  3      4      5      6    7    8    9            10    11    12
               |    |        |  |      |      |      |    |    |    |            |     |     |
               |-.  |-.      |  |-.-.  |-.-.  |-.-.  |-.  |-.  |    |--.--.--.   |--.  |--.  |--.
               | |  | |      |  | | |  | | |  | | |  | |  | |  |    |  |  |  |   |  |  |  |  |  |
               v v  v v      |  v v v  v v v  v v v  v v  v v  |    v  v  v  v   v  v  v  v  v  v
Neighbor (j):  0 1  0 1      2  3 4 5  3 4 5  3 4 5  6 7  6 7  8    9  10 11 12  9  10 9  11 9  12

The connections from the diagram above, flattened into two parallel arrays:
             [-SAMPLE 1-]   [-------------SAMPLE 2--------------]  [-----------SAMPLE 3-----------]
Query (i):     0 0  1 1      2  3 3 3  4 4 4  5 5 5  6 6  7 7  8    9  9  9  9   10 10 11 11 12 12
Neighbor (j):  0 1  0 1      2  3 4 5  3 4 5  3 4 5  6 7  6 7  8    9  10 11 12  9  10 9  11 9  12

+---------------------------------------------------------------------------------+
| THE UNIFIED LIST: (i, j) pairs                                                  |
| Status: EFFICIENT. A compact stream of all neighborhood edges in the batch.     |
+---------------------------------------------------------------------------------+

Convolution on Native Points

The full process of convolution on native points involves four steps: output location generation, neighborhood search, convolution triplet construction and the Matrix-Vector Multiplication and Reduction (MVMR). The actual convolution computation occurs in the final MVMR stage, while the preceding three steps serve to structure the input data for this calculation.

Output Location Generation

This step defines the spatial centers for the convolution operations. Depending on the desired architectural effect, the output locations are generated in one of three ways:

Standard Convolution ($stride=1$): The input points serve directly as the output locations.
Strided Convolution ($stride > 1$): Output locations are generated by downsampling the input points via $grid_sample_filter$ with a target grid size of $grid_size_{input} \times stride$ (see Downsampling). Note that the $grid_size$ of the output point cloud would be $grid_size_{input} \times stride$.
Transposed Convolution (Upconvolution): The output locations are explicitly set to the pre-calculated 'upsampled' points, as described in the Upsampling section.

Neighborhood Search

This step executes the neighbor finding process detailed in Neighborhood Computation and Representation by invoking the $radius_search$ routine. The generated output locations serve as the query points, while the input points serve as the source points.

The search radius is determined by the formula $radius = radius_scaler \times grid_size$. The $radius_scaler$ defines the geometric relationship between the spherical search region and the cubic receptive field (governed by the $receptive_field$ hyperparameter). Common configurations for $radius_scaler$ include:

Inscribed Sphere: $radius_scaler = \frac{1}{2} \times receptive_field$. The search ball is the largest sphere that fits inside the ${receptive_field}^3$ cube.
Equal Volume (Default): $radius_scaler = \sqrt[3]{\frac{3}{4 \pi}} \times receptive_field$. The search ball has the same volume as the ${receptive_field}^3$ cube.
Circumscribed Sphere: $radius_scaler = \frac{\sqrt{3}}{2} \times receptive_field$. The search ball is the smallest sphere that encloses the ${receptive_field}^3$ cube.

Convolution Triplet Construction

To perform convolution on the irregular structure as depicted by the neighborhood $(i, j)$ pairs, we must explicitly link each neighborhood edge to the model's parameters. As illustrated in the diagram below, this is achieved by extending the neighborhood list into Convolution Triplets $(i, j, k)$. Unlike image convolutions where weight usage is implicit based on a fixed pixel grid, point clouds require a calculated assignment. As illustrated below, for every connection between a query point $i$ and neighbor $j$, a specific kernel weight index $k$ (from a shared bank of weights, e.g., $0 \dots 8$) is determined based on the position of neighbor $j$ within the local coordinate system centered at query point $i$. This "triplet" structure explicitly routes data from the neighbor $j$, through the correct weight $k$, to the target point $i$. Note that the kernel weights are shared across all samples in the batch. This triplet representation provides the basis for an efficient convolution implementation.

             [-SAMPLE 1-]   [-------------SAMPLE 2--------------]  [-----------SAMPLE 3-----------]
Query (i):     0    1        2  3      4      5      6    7    8    9            10    11    12
               |    |        |  |      |      |      |    |    |    |            |     |     |
               |-.  |-.      |  |-.-.  |-.-.  |-.-.  |-.  |-.  |    |--.--.--.   |--.  |--.  |--.
               | |  | |      |  | | |  | | |  | | |  | |  | |  |    |  |  |  |   |  |  |  |  |  |
               v v  v v      |  v v v  v v v  v v v  v v  v v  |    v  v  v  v   v  v  v  v  v  v
Neighbor (j):  0 1  0 1      2  3 4 5  3 4 5  3 4 5  6 7  6 7  8    9  10 11 12  9  10 9  11 9  12
               | |  | |      |  | | |  | | |  | | |  | |  | |  |    |  |  |  |   |  |  |  |  |  |
               v v  v v      v  v v v  v v v  v v v  v v  v v  v    v  v  v  v   v  v  v  v  v  v
Kernel (k):    4 5  5 4      4  4 6 8  2 4 5  1 3 4  4 7  1 4  4    4  2  6  8   6  4  2  4  0  4

The connections from the diagram above, flattened into three parallel arrays:
             [-SAMPLE 1-]   [-------------SAMPLE 2--------------]  [-----------SAMPLE 3-----------]
Query (i):     0 0  1 1      2  3 3 3  4 4 4  5 5 5  6 6  7 7  8    9  9  9  9   10 10 11 11 12 12
Neighbor (j):  0 1  0 1      2  3 4 5  3 4 5  3 4 5  6 7  6 7  8    9  10 11 12  9  10 9  11 9  12
Kernel (k):    4 5  5 4      4  4 6 8  2 4 5  1 3 4  4 7  1 4  4    4  2  6  8   6  4  2  4  0  4
+---------------------------------------------------------------------------------+
| THE UNIFIED LIST: (i, j, k) triplets                                            |
| Status: EFFICIENT. A compact stream of all computation edges in the batch.      |
+---------------------------------------------------------------------------------+

More specifically, to determine the kernel index $k$, neighborhood points are first transformed into the query point's local coordinate system and normalized to a unit scale. A simple $voxelize_3d$ routine then uniformly discretizes this unit space into $kernel_size^3$ voxels; the index $k$ is assigned based on which voxel a neighbor occupies. Although the visualization above sorts triplets by query index $i$ for clarity, the actual implementation typically re-sorts them by kernel index $k$. This optimization reorganizes the data to maximize efficiency during the subsequent Matrix-Vector Multiplication and Reduction (MVMR) stage.

It is crucial to distinguish between $receptive_field$ and $kernel_size$, as these concepts are explicitly decoupled in our framework. While often synonymous in single-layer image convolutions, here they serve distinct roles: $receptive_field$ defines the physical spatial extent of the neighborhood search, whereas $kernel_size$ determines the resolution of the local voxelization (i.e., the granularity of the weight grid applied within that space). This decoupling provides flexible, fine-grained control over the architectural design.

Matrix-Vector Multiplication and Reduction (MVMR)

As detailed in the research paper of PointCNN++, the actual heavy lifting of the convolution arithmetic occurs in this final stage. To encapsulate this complexity, we provide a high-level $PointConv3d$ layer. The $forward$ function of this layer accepts the input feature tensor and the generated convolution triplets (typically pre-sorted by the kernel index $k$) to execute the operation. This design ensures that the GPU pipelines remain saturated with dense arithmetic (Matrix-Vector Multiplication) rather than stalling on irregular memory access.

It is worth emphasizing that in this framework, feature tensors are first-class citizens. They are the primary carriers of the learned signal and the subject of all gradient backpropagation. The spatial coordinates, having served their purpose in generating the neighbor lists and triplets, are treated simply as "metadata" that guides the data-weight flow, rather than being part of the arithmetic computation itself.

Citation

Pointelligence is the repo for the official implementation of:

PointCNN++: Performant Convolution on Native Points
Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li

@misc{li2025pointcnnperformantconvolutionnative,
      title={PointCNN++: Performant Convolution on Native Points},
      author={Lihan Li and Haofeng Zhong and Rui Bu and Mingchao Sun and Wenzheng Chen and Baoquan Chen and Yangyan Li},
      year={2025},
      eprint={2511.23227},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.23227},
}

Feature Requests and Issues

To ensure they are tracked effectively, please submit feature requests and issue reports here rather than via email.

License

MIT -- see LICENSE for details.

MLX port by AIFLOW LABS / RobotFlow Labs
Pioneering Apple Silicon ML infrastructure for robotics and spatial intelligence

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
examples		examples
extensions		extensions
internals		internals
layers		layers
mlx_port		mlx_port
models		models
sparse_engines		sparse_engines
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LEGAL.md		LEGAL.md
LICENSE		LICENSE
MLX_IMPLEMENTATION_GUIDE.md		MLX_IMPLEMENTATION_GUIDE.md
MLX_PORTING_FEASIBILITY.md		MLX_PORTING_FEASIBILITY.md
MLX_PORTING_README.md		MLX_PORTING_README.md
MLX_QUICK_REFERENCE.md		MLX_QUICK_REFERENCE.md
MLX_RISKS_AND_MITIGATIONS.md		MLX_RISKS_AND_MITIGATIONS.md
README.md		README.md
requirements.sh		requirements.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PointelligenceAccelerating Point Cloud Learning for Spatial Intelligence

Two Backends, One Framework

PyTorch Backend (Original)

Clone the Repository

Installation

Basic Usages

Point Cloud Registration Task

Point Cloud Segmentation Task

MLX Backend (Apple Silicon)

Quick Start

What Was Ported

5 Custom Sparse Kernels (Triton/CUDA to MLX)

Complete Layer Stack

Spatial Infrastructure

How the Port Works

MLX Test Suite

Cross-Framework Validation

MLX Models

MLX Training

Optimizer Support

MLX Architecture

MLX Requirements

Advanced Topics

Ragged Tensors for Efficient Batching

The Inefficiency of Padding

The Ragged Tensor Solution

Downsampling and Upsampling

Downsampling

Upsampling

Neighborhood Computation and Representation

Convolution on Native Points

Output Location Generation

Neighborhood Search

Convolution Triplet Construction

Matrix-Vector Multiplication and Reduction (MVMR)

Citation

Feature Requests and Issues

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Pointelligence
_{^{Accelerating Point Cloud Learning for Spatial Intelligence}}

Packages