Metal: Add deformable conv2d and performance optimizations #3355

imperatormk · 2026-02-01T21:16:27Z

Summary

Adds Metal GPU kernels for deformable convolution 2D (DCNv2) on Apple Silicon, plus several performance optimizations for the Metal backend.

Deformable Conv2D Kernels

deformable_im2col: forward pass with bilinear sampling at learned offsets
deformable_col2im: backward pass for input gradients
deformable_col2im_coord: backward pass for offset and mask gradients

Performance Optimizations

MPSGraph Conv2d: Uses Apple's native MPSGraph for regular conv2d (~7x faster for 3x3+ kernels)
Broadcast optimization: Fast path for inner-dimension broadcast [B, N, C] + [C] (~2.7x faster Linear layers)
Transpose optimization: Fast transpose_last2 kernel for attention K^T pattern (~14x faster contiguous)

Features

Support for modulated deformable convolution (optional mask)
Configurable stride, padding, dilation, and offset groups
Float32 and Float16 variants
Atomic add for thread-safe gradient accumulation
MPSGraph auto-selected for larger kernels, falls back to im2col for 1x1

Benchmarks

Operation	Before	After	Speedup
Conv2d 7x7	157ms	21.6ms	7.3x
Linear layer	46ms	17ms	2.7x
Transpose last 2 dims	20ms	1.4ms	14x

Use Cases

Enables models that rely on deformable convolutions to run on Apple Silicon:

DCNv2 (Deformable ConvNets v2)
Deformable DETR
BiRefNet (background removal)
Various object detection backbones

Origin

Deformable conv ported from mps-deform-conv, a standalone PyTorch MPS extension.

Test Plan

Added unit tests in candle-metal-kernels/src/tests.rs
Verified forward pass produces correct output shape
Tested with BiRefNet model inference
All 226 existing Metal tests pass

Implements deformable conv2d (DCNv2) for Metal/Apple Silicon: - deformable_im2col: forward pass with bilinear sampling at learned offsets - deformable_col2im: backward pass for input gradients - deformable_col2im_coord: backward pass for offset and mask gradients Features: - Support for modulated deformable convolution (optional mask) - Configurable stride, padding, dilation, and offset groups - Float32 and Float16 variants - Atomic add for thread-safe gradient accumulation Ported from mps-deform-conv (https://github.com/mpsops/mps-deform-conv). Enables models like DCNv2, Deformable DETR, and BiRefNet on Apple Silicon.

- Use MPSGraph for conv2d (7x faster for 3x3+ kernels) - Add fast broadcast kernel for inner-dim pattern [B,N,C]+[C] (2.7x faster Linear) - Add fast transpose_last2 kernel for attention K^T pattern (14x faster contiguous)

- Add graph caching to avoid recompilation on each call - Use MPSGraphExecutable for pre-compiled graph execution - Use MPSCommandBuffer for async GPU execution (non-blocking) - Use direct MTLBuffer for zero-copy tensor data - Add command_queue() getter to Commands struct Benchmarks (M3 Max): - 3x3 conv 256x256: 12.5ms -> 2.5ms (5x faster, matches PyTorch) - 7x7 conv 256x256: 21ms -> 11ms (2x faster, matches PyTorch)

Cast MTLBuffer to AnyObject pointer type for correct Objective-C message passing.

bghira · 2026-02-09T17:52:15Z

the linked mps-deform-conv does not seem to exist.

imperatormk · 2026-02-09T18:04:31Z

the linked mps-deform-conv does not seem to exist.

✅

bghira · 2026-02-09T18:36:11Z

i still think you're causing problems by vibe-coding so many extensions and then going around Github to push for their inclusion before they're a) reviewed, b) vetted for inclusion. you did this on so many repositories so far i'm really wondering what is your motivation? why are you doing this?

imperatormk · 2026-02-09T18:54:46Z

I partially validate your point but this is not the place for this

imperatormk added 2 commits February 1, 2026 22:15

imperatormk changed the title ~~Add Metal deformable convolution 2D kernels~~ Metal: Add deformable conv2d and performance optimizations Feb 2, 2026

imperatormk added 2 commits February 2, 2026 17:50

Fix MPSGraphTensorData buffer pointer cast for initWithMTLBuffer

fccf488

Cast MTLBuffer to AnyObject pointer type for correct Objective-C message passing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: Add deformable conv2d and performance optimizations #3355

Metal: Add deformable conv2d and performance optimizations #3355

Uh oh!

imperatormk commented Feb 1, 2026 •

edited

Loading

Uh oh!

bghira commented Feb 9, 2026

Uh oh!

imperatormk commented Feb 9, 2026

Uh oh!

bghira commented Feb 9, 2026

Uh oh!

imperatormk commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Metal: Add deformable conv2d and performance optimizations #3355

Are you sure you want to change the base?

Metal: Add deformable conv2d and performance optimizations #3355

Uh oh!

Conversation

imperatormk commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deformable Conv2D Kernels

Performance Optimizations

Features

Benchmarks

Use Cases

Origin

Test Plan

Uh oh!

bghira commented Feb 9, 2026

Uh oh!

imperatormk commented Feb 9, 2026

Uh oh!

bghira commented Feb 9, 2026

Uh oh!

imperatormk commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imperatormk commented Feb 1, 2026 •

edited

Loading