nki-conv3d

The first NKI Conv3d kernel for AWS Trainium.

Why

Video generation models (Wan2.1/2.2, CogVideoX, HunyuanVideo) use 3D VAEs built on Conv3d / CausalConv3d. AWS Trainium has no native Conv3d support — the NKI ecosystem only has Conv1d. This gap blocks all video generation models from running on Trainium.

This repo fills that gap.

NKI Kernel	Exists Before This Repo
Conv1d	✅ (nki-library)
Conv2d	❌
Conv3d	❌ → ✅ this repo

Algorithm

Conv3d is decomposed into temporal slices of Conv2d, each computed via im2col + GEMM:

output[:, :, d, :, :] = Σ_{kd} Conv2d(input[:, :, d*s+kd, :, :], weight[:, :, kd, :, :])

This decomposition is exact (not an approximation). The host-side wrapper builds im2col matrices and pads all dimensions to multiples of 128, then a tiled NKI matmul kernel runs the GEMM via nisa.nc_matmul.

Files

File	Description
`conv3d.py`	NKI kernel — the main deliverable
`conv3d_ref.py`	NumPy reference (im2col + matmul) for testing
`test_conv3d.py`	138+ test cases across 3 layers

Test Coverage

Layer	Cases	Source
PyTorch standard	12	Adapted from `torch/testing/_internal/common_nn.py`
Wan2.1/2.2 VAE configs	12	Actual CausalConv3d shapes from `wan/modules/vae.py`
CogVideoX-5b VAE configs	15	From `THUDM/CogVideoX-5b` vae/config.json
HunyuanVideo VAE configs	18	From `tencent/HunyuanVideo` vae/config.json
BFloat16 precision	12	bf16-quantized inputs vs PyTorch bf16
Dilation	8	Uniform, spatial-only, temporal-only, asymmetric
Grouped / depthwise	15	groups=2/4, depthwise, with stride/padding/bias
Edge cases	7+	Single channel, D=1, mixed strides, causal padding

All tests compare against torch.nn.functional.conv3d as ground truth.

Quick Start

Run reference tests (any machine, no NKI needed)

pip install numpy pytest torch
pytest test_conv3d.py -k "Ref" -v

Run NKI kernel tests via Docker (macOS / any machine)

neuronxcc only runs on x86_64 Linux. Use Docker:

docker build --platform linux/amd64 -t nki-conv3d .
docker run --platform linux/amd64 nki-conv3d

This uses nki.simulate_kernel — no Trainium hardware required.

Run NKI kernel tests directly (x86_64 Linux only)

pip install neuronx-cc==2.* numpy torch pytest \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com
pytest test_conv3d.py -k "NKI" -v

Use in your model

from conv3d import conv3d

# Calls NKI tiled_matmul_kernel internally (CPU simulation, no hardware needed)
result = conv3d(input_np, weight_np, bias_np,
                stride=(1, 1, 1), padding=(1, 1, 1))

CausalConv3d (Wan2.1/2.2 VAE)

Wan's CausalConv3d applies asymmetric temporal padding (2*pad, 0) before calling standard Conv3d. This kernel handles the Conv3d part; causal padding is done at the Python wrapper level:

import numpy as np
from conv3d_ref import conv3d_ref

# Simulate CausalConv3d(3,3,3) with padding=(1,1,1)
input_causal = np.pad(input, ((0,0), (0,0), (2,0), (1,1), (1,1)), mode="constant")
output = conv3d_ref(input_causal, weight, stride=(1,1,1), padding=(0,0,0))

Roadmap

aws-neuron/nki-library — Official NKI kernels (Conv1d, Flash Attention, RoPE, RMSNorm)
aws-neuron/nki-samples — NKI tutorials and examples
Wan-Video/Wan2.1 — Video generation model whose 3D VAE needs this kernel
neuronx-distributed-inference #57 — LTX-2 video model on Trainium (DiT only, no Conv3D)

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LESSONS_LEARNED.md		LESSONS_LEARNED.md
LICENSE		LICENSE
README.md		README.md
conv3d.py		conv3d.py
conv3d_ref.py		conv3d_ref.py
pyproject.toml		pyproject.toml
test_conv3d.py		test_conv3d.py
test_minimal_matmul.py		test_minimal_matmul.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nki-conv3d

Why

Algorithm

Files

Test Coverage

Quick Start

Run reference tests (any machine, no NKI needed)

Run NKI kernel tests via Docker (macOS / any machine)

Run NKI kernel tests directly (x86_64 Linux only)

Use in your model

CausalConv3d (Wan2.1/2.2 VAE)

Roadmap

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nki-conv3d

Why

Algorithm

Files

Test Coverage

Quick Start

Run reference tests (any machine, no NKI needed)

Run NKI kernel tests via Docker (macOS / any machine)

Run NKI kernel tests directly (x86_64 Linux only)

Use in your model

CausalConv3d (Wan2.1/2.2 VAE)

Roadmap

Related

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages