Skip to content

Conversation

@imperatormk
Copy link

@imperatormk imperatormk commented Feb 1, 2026

Summary

Adds Metal GPU kernels for deformable convolution 2D (DCNv2) on Apple Silicon, plus several performance optimizations for the Metal backend.

Deformable Conv2D Kernels

  • deformable_im2col: forward pass with bilinear sampling at learned offsets
  • deformable_col2im: backward pass for input gradients
  • deformable_col2im_coord: backward pass for offset and mask gradients

Performance Optimizations

  • MPSGraph Conv2d: Uses Apple's native MPSGraph for regular conv2d (~7x faster for 3x3+ kernels)
  • Broadcast optimization: Fast path for inner-dimension broadcast [B, N, C] + [C] (~2.7x faster Linear layers)
  • Transpose optimization: Fast transpose_last2 kernel for attention K^T pattern (~14x faster contiguous)

Features

  • Support for modulated deformable convolution (optional mask)
  • Configurable stride, padding, dilation, and offset groups
  • Float32 and Float16 variants
  • Atomic add for thread-safe gradient accumulation
  • MPSGraph auto-selected for larger kernels, falls back to im2col for 1x1

Benchmarks

Operation Before After Speedup
Conv2d 7x7 157ms 21.6ms 7.3x
Linear layer 46ms 17ms 2.7x
Transpose last 2 dims 20ms 1.4ms 14x

Use Cases

Enables models that rely on deformable convolutions to run on Apple Silicon:

  • DCNv2 (Deformable ConvNets v2)
  • Deformable DETR
  • BiRefNet (background removal)
  • Various object detection backbones

Origin

Deformable conv ported from mps-deform-conv, a standalone PyTorch MPS extension.

Test Plan

  • Added unit tests in candle-metal-kernels/src/tests.rs
  • Verified forward pass produces correct output shape
  • Tested with BiRefNet model inference
  • All 226 existing Metal tests pass

Implements deformable conv2d (DCNv2) for Metal/Apple Silicon:

- deformable_im2col: forward pass with bilinear sampling at learned offsets
- deformable_col2im: backward pass for input gradients
- deformable_col2im_coord: backward pass for offset and mask gradients

Features:
- Support for modulated deformable convolution (optional mask)
- Configurable stride, padding, dilation, and offset groups
- Float32 and Float16 variants
- Atomic add for thread-safe gradient accumulation

Ported from mps-deform-conv (https://github.com/mpsops/mps-deform-conv).
Enables models like DCNv2, Deformable DETR, and BiRefNet on Apple Silicon.
- Use MPSGraph for conv2d (7x faster for 3x3+ kernels)
- Add fast broadcast kernel for inner-dim pattern [B,N,C]+[C] (2.7x faster Linear)
- Add fast transpose_last2 kernel for attention K^T pattern (14x faster contiguous)
@imperatormk imperatormk changed the title Add Metal deformable convolution 2D kernels Metal: Add deformable conv2d and performance optimizations Feb 2, 2026
- Add graph caching to avoid recompilation on each call
- Use MPSGraphExecutable for pre-compiled graph execution
- Use MPSCommandBuffer for async GPU execution (non-blocking)
- Use direct MTLBuffer for zero-copy tensor data
- Add command_queue() getter to Commands struct

Benchmarks (M3 Max):
- 3x3 conv 256x256: 12.5ms -> 2.5ms (5x faster, matches PyTorch)
- 7x7 conv 256x256: 21ms -> 11ms (2x faster, matches PyTorch)
Cast MTLBuffer to AnyObject pointer type for correct Objective-C message passing.
@bghira
Copy link

bghira commented Feb 9, 2026

the linked mps-deform-conv does not seem to exist.

@imperatormk
Copy link
Author

the linked mps-deform-conv does not seem to exist.

@bghira
Copy link

bghira commented Feb 9, 2026

i still think you're causing problems by vibe-coding so many extensions and then going around Github to push for their inclusion before they're a) reviewed, b) vetted for inclusion. you did this on so many repositories so far i'm really wondering what is your motivation? why are you doing this?

@imperatormk
Copy link
Author

I partially validate your point but this is not the place for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants