This repo contains two complete flows (run from the repo root):
- DPU dialect MLIR -> LLVM IR -> DPU binary -> host runner
- Triton -> LLVM IR -> DPU binary -> host runner
The goal is to be able to run on the UPMEM simulator from a clean checkout.
dpu/vec_add.mlir: DPU dialect version of the vector add kernel.dpu/vec_add_args.c: definesDPU_INPUT_ARGUMENTSin the host section.host/vec_add_host.c: host runner for the DPU dialect flow.dpu_min_test.py: Triton kernel, nowC = A + B.dpu/triton_wrapper.ll: auto-generated DPU entrypoint (main) that callsadd_kernel.dpu/triton_args.h/dpu/triton_args.c: auto-generated DPU args struct + definition.host/triton_args.h: auto-generated host args struct.host/vec_add_triton_host.c: host runner for the Triton flow.scripts/host_triton_ir.sh: host-side script to emit Triton LLVM IR.scripts/container_triton_run.sh: container-side script to build/run on the simulator.scripts/generate_triton_wrapper.py: generatestriton_wrapper.ll+ args headers from the Triton IR.scripts/host_triton_pack.py: host-side packer that creates a self-contained artifact folder.scripts/container_artifact_run.sh: container-side build/run for artifacts (no per-kernel C edits).scripts/use_local_deps.sh: helper to link local deps or emit env exports.third_party/: placeholders for Triton + UPMEM LLVM (can be git submodules).artifacts/: generated Triton artifact folders (safe to delete/recreate).
Host laptop:
- Triton repo (suggested location):
third_party/triton - Python venv (example):
.venv/triton_envor any Python with Triton inPYTHONPATH - UPMEM LLVM build (suggested location):
third_party/upmem_llvm/llvm-project/build
You can override paths via environment variables:
TRITON_PY,TRITON_SRC,UPMEM_OPT,TRITON_CACHE_DIRThis repo also supportsthird_party/symlinks, but the recommended flow here is to set the exports directly (no symlinks required).
Container:
- UPMEM SDK installed (provides
dpu-upmem-dpurte-clang,dpu-pkg-config) - Repo root mounted at
/mnt/host_cwd
These flows rely on DPU-specific patches. Using upstream Triton/LLVM will not
work. Use the provided branches under third_party/:
-
Triton:
third_party/tritonon branchprathamesh_triton_pim- Build it locally:
cd third_party/triton # optional venv python -m venv .venv source .venv/bin/activate pip install -r python/requirements.txt TRITON_BUILD_WITH_CLANG_LLD=true pip install -e . --no-build-isolation - Then set
TRITON_SRCand (if using a venv)TRITON_PY.
- Build it locally:
-
UPMEM LLVM:
third_party/upmem_llvm/llvm-project(branchmain)- Build
opt/mlir-optwith thedpu-legalizepass:cd third_party/upmem_llvm/llvm-project cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_PROJECTS="mlir;llvm;lld" -DLLVM_TARGETS_TO_BUILD="host" \ -B build llvm ninja -C build opt mlir-opt mlir-translate - Then set
UPMEM_OPT=third_party/upmem_llvm/llvm-project/build/bin/opt.
- Build
From repo root:
export TRITON_SRC=/path/to/triton
export UPMEM_OPT=/path/to/upmem_llvm/llvm-project/build/bin/opt
# optional
# export TRITON_PY=/path/to/venv/bin/python
./scripts/host_triton_pack.py --artifact-dir ./artifacts/add --out-indices 2
./container/start_docker.sh
Inside the container:
cd /mnt/host_cwd
./scripts/container_artifact_run.sh --arg 3=10
Artifact flow with multiple DPUs (host + container):
# host
./scripts/host_triton_pack.py --artifact-dir ./artifacts/add --out-indices 2
./container/start_docker.sh
# container
ARTIFACT_DIR=/mnt/host_cwd/artifacts/add ./scripts/container_artifact_run.sh \
--nr-dpus 4 --arg 3=100
This bypasses the packer and runs the Triton compiler directly, then generates the wrapper/args. From repo root:
export TRITON_SRC=/path/to/triton
export UPMEM_OPT=/path/to/upmem_llvm/llvm-project/build/bin/opt
export TRITON_PY=/path/to/venv/bin/python
$TRITON_PY ./dpu_min_test.py --out bin/triton_add.ll
$TRITON_PY scripts/generate_triton_wrapper.py \
--ir bin/triton_add.ll \
--out dpu/triton_wrapper.ll \
--kernel add_kernel \
--dpu-args-h dpu/triton_args.h \
--dpu-args-c dpu/triton_args.c \
--host-args-h host/triton_args.h
./container/start_docker.sh
Inside the container:
cd /mnt/host_cwd
./scripts/container_triton_run.sh
scripts/use_local_deps.shcan createthird_partysymlinks and/or emitexportlines forTRITON_SRC,UPMEM_OPT,TRITON_PY,TRITON_CACHE_DIR.- Example (symlink + export in current shell):
source scripts/use_local_deps.sh --link --triton /path/to/triton --upmem /path/to/upmem_llvm
- Example (symlink + export in current shell):
This is the original DPU-dialect pipeline.
Purpose: convert DPU dialect to LLVM IR so the UPMEM toolchain can compile it.
third_party/upmem_llvm/llvm-project/build/bin/mlir-opt \
dpu/vec_add.mlir \
--convert-dpu-to-llvm \
-o bin/vec_add_llvm.mlir
third_party/upmem_llvm/llvm-project/build/bin/mlir-translate \
--mlir-to-llvmir \
bin/vec_add_llvm.mlir \
-o bin/vec_add.ll
Purpose: compile LLVM IR + host arguments into a DPU executable.
dpu-upmem-dpurte-clang -O2 -g0 -x ir -c /mnt/host_cwd/bin/vec_add.ll \
-o /mnt/host_cwd/bin/vec_add.o
dpu-upmem-dpurte-clang -O2 -g0 -c /mnt/host_cwd/dpu/vec_add_args.c \
-o /mnt/host_cwd/bin/vec_add_args.o
/upmem-sdk-2023.1.0/bin/llvm-objcopy --remove-section=.eh_frame \
/mnt/host_cwd/bin/vec_add.o
/upmem-sdk-2023.1.0/bin/llvm-objcopy --remove-section=.eh_frame \
/mnt/host_cwd/bin/vec_add_args.o
dpu-upmem-dpurte-clang \
/mnt/host_cwd/bin/vec_add.o \
/mnt/host_cwd/bin/vec_add_args.o \
-o /mnt/host_cwd/bin/vec_add
Purpose: move inputs to MRAM, launch the DPU, verify outputs.
cc -O2 -std=c11 -Wall -Wextra \
-o /mnt/host_cwd/bin/vec_add_host \
/mnt/host_cwd/host/vec_add_host.c \
`dpu-pkg-config --cflags --libs dpu` \
-DNR_DPUS=1
/mnt/host_cwd/bin/vec_add_host
# Optional: pass element count at runtime
/mnt/host_cwd/bin/vec_add_host 50
Note: this flow was validated up to 512 elements due to alignment/transfer constraints.
This uses Triton to generate LLVM IR, then runs on the simulator using the same host/DPU wiring as above.
Purpose: compile the Triton kernel to LLVM IR in bin/triton_add.ll.
./scripts/host_triton_ir.sh
The script sets:
PYTHONPATH=third_party/triton/pythonTRITON_BACKENDS_IN_TREE=1TRITON_DPU=1TRITON_DPU_OPT=third_party/upmem_llvm/llvm-project/build/bin/optTRITON_CACHE_DIR(writable cache)
Overrides (optional):
TRITON_PY(python path)TRITON_SRC(triton repo path)UPMEM_OPT(opt path)TRITON_CACHE_DIROUT_LL(output .ll path)KERNEL_NAME(defaultadd_kernel)
Purpose: compile the Triton IR and run it on the simulator.
cd /mnt/host_cwd
./scripts/container_triton_run.sh
Overrides (optional):
NR_DPUS(default 1)
Triton emits only add_kernel. The DPU toolchain expects a main entrypoint
and uses DPU_INPUT_ARGUMENTS for parameters. The host script now generates:
dpu/triton_wrapper.ll(DPU entrypoint)dpu/triton_args.h/dpu/triton_args.c(DPU args struct + definition)host/triton_args.h(host-side args struct)
Generation is done by parsing the kernel signature in bin/triton_add.ll:
- MRAM pointer args (
addrspace(255)) are loaded fromDPU_INPUT_ARGUMENTSand bitcast to the exact pointer type. - Pointer args in
addrspace(1)are passed asnull(these are internal Triton args and should remain unused). - Scalar args are loaded from
DPU_INPUT_ARGUMENTSin order.
This keeps the wrapper and argument structs in sync with the Triton IR without manual editing.
This flow makes Triton the only user-facing surface. The host packs a self-contained artifact folder; the container compiles and runs it with a generic runner.
Purpose: generate kernel.ll, wrapper/args, and a host runner without manual C edits.
./scripts/host_triton_pack.py \
--artifact-dir ./artifacts/add \
--out-indices 2
Notes:
--out-indicesuses kernel arg indices (0-based).- For the default
add_kernelindpu_min_test.py, the signature is:- args 0/1/2: MRAM pointers (A, B, C)
- arg 3: scalar
n - args 4/5: internal
addrspace(1)pointers (ignored)
- You can supply your own Triton script with
--triton-script <file>. The script must accept--out <path>and write the LLVM IR there. - Default Triton script:
./dpu_min_test.pyif present, otherwise$TRITON_SRC/python/triton/backends/dpu/dpu_min_test.py. - To set a default DPU count from the Triton script, you can use any of:
- a kernel launch keyword:
kernel[grid](..., num_dpus=4) - a
triton.Config(..., num_dpus=4)inside@triton.autotune - a top-level literal:
TRITON_DPU_NUM_DPUS = 4The generated host runner uses the first found in that order, unless--nr-dpusis passed at runtime. (Example:dpu_min_test.pysetsTRITON_DPU_NUM_DPUS = 4.)
- a kernel launch keyword:
- Environment overrides:
TRITON_PY,TRITON_SRC,UPMEM_OPT,TRITON_CACHE_DIR.
Artifact contents:
kernel.ll: Triton LLVM IRwrapper.ll: DPU entrypoint wrappertriton_args.h/.c: DPU args struct + definitionhost_args.h: host args structhost_runner.c: auto-generated host runnermeta.json: kernel signature + out indices
Purpose: compile the DPU binary and run with the auto-generated host runner.
cd /mnt/host_cwd
ARTIFACT_DIR=/mnt/host_cwd/artifacts/add ./scripts/container_artifact_run.sh \
--arg 3=10
--arg 3=10 sets scalar kernel argument index 3 (the N length) to 10.
Optional overrides:
--nr-dpus Nto request multiple DPUs (inputs are sharded by default; this overrides anynum_dpusdefault from the Triton script).--len IDX=Nto set element counts per pointer arg (default uses the single scalar arg if present, e.g. arg 3 forn).--in IDX=PATHto provide raw input data for a pointer arg.--out IDX=PATHto write raw output data for a pointer arg. Note: the auto-generated host runner supports multiple DPUs and shards inputs by default.
Example (explicit lengths + output capture):
ARTIFACT_DIR=/mnt/host_cwd/artifacts/add ./scripts/container_artifact_run.sh \
--arg 3=10 --len 0=10 --len 1=10 --len 2=10 --out 2=/tmp/c.bin
The upmem opt is LLVM 12, which cannot parse opaque pointers. The DPU backend
now auto-converts ptr to typed pointers and retries opt when needed.
Useful env flags:
- Skip legalization:
TRITON_DPU_SKIP_LEGALIZE=1 - Force legalization errors:
TRITON_DPU_FORCE_LEGALIZE=1
When legalization runs:
[dpu_legalize] using opt: third_party/upmem_llvm/llvm-project/build/bin/opt
- Ensure full IR:
tail -n 5 bin/triton_add.ll - Ensure
.eh_framestripped:llvm-readelf -l bin/vec_add(no.eh_framein MRAM)
error: expected typeinopt: LLVM 12 rejected opaque pointers. Ensure you are using the in-tree Triton backend and let legalization run, or setTRITON_DPU_SKIP_LEGALIZE=1.Cast between addresses of different address space is not supported: you are compiling IR that still containsaddrspacecast. Regenerate IR with the updated Triton DPU backend and re-runhost_triton_ir.sh.found end of file when expecting more instructions: you redirected only the preview output. Usedpu_min_test.py --out <file>orhost_triton_ir.sh.DPU Error (invalid mram access): MRAM offsets/lengths must be 8-byte aligned. Keep the host-side align-to-8 logic and pass aligned sizes to transfers.DPU Error (undefined symbol)orinvalid memory symbol access: ensuretriton_args.c(Triton flow) orvec_add_args.c(DPU dialect flow) is linked into the DPU binary andDPU_INPUT_ARGUMENTSexists.PermissionErrorfor.triton/cache: setTRITON_CACHE_DIRto a writable path.