Skip to content
yiyu-metax edited this page Dec 23, 2025 · 5 revisions

Welcome to the mcoplib wiki!

mcoplib

mcoplib is a comprehensive operator library designed to support AI network architectures including LLMs, CNNs, DNNs, and CV models. It encompasses not only CUDA-based AI operators but also Triton-based AI operators, offering an extensive range of supported operator types—from conventional AI operators for CV and CNN tasks to commonly used and fused operators for LLMs, such as Dense and MOE layers.

Currently, this project supports all custom CUDA operators required by vLLM-MetaX, all custom CUDA operators relied upon by SGLang, and all CUDA-based operators necessary for LMDeploy. Additionally, it accommodates customized CUDA operators for large enterprise scenarios and traditional CV-oriented CUDA operators. mcoplib provides core operators essential for inference and training of mainstream LLMs, including Qwen3, DeepSeek, Gemini, and GLM. The library further supports operators for multiple quantization formats, intra-machine and inter-machine cluster communication, typical fused operators, and K/V cache operators.

The mcoplib operator API is primarily categorized into three types:

Traditional CV and CNN operators expose C/C++ APIs. Custom CUDA operators are exposed via Python APIs using pybind11. Operators required by mainstream LLM inference frameworks (e.g., vLLM, SGLang) provide dynamic registration APIs for integration with PyTorch.

Framework

image ## Profiling

This tool enables operator profiling performance debugging with a single line of code, avoiding repetitive and cumbersome torch.profiler calls in the code.

mxbench

mxbench is used for operator performance benchmarking. The performance metrics include:

  • Memory bandwidth (BW)
  • GPU Time
  • CPU Time
  • Data Size

Currently, only C/C++ external operator APIs are supported. Support for Python API operator performance benchmarking will be added in the future.

mxbench Performance Example

NumElements DataSize Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Samples
16777216 64.000 MiB 2624x 219.242 us 31.08% 191.086 us 1.66% 87.799G 702.396 GB/s 38.11% 2818x

MetaX C280

NumElements DataSize Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Samples
16777216 64.000 MiB 2592x 225.837 us 23.57% 193.315 us 1.52% 86.787G 694.295 GB/s 37.67% 2806x

Clone this wiki locally