Home

Welcome to the mcoplib wiki!

mcoplib

mcoplib is a comprehensive operator library designed to support AI network architectures including LLMs, CNNs, DNNs, and CV models. It encompasses not only CUDA-based AI operators but also Triton-based AI operators, offering an extensive range of supported operator types—from conventional AI operators for CV and CNN tasks to commonly used and fused operators for LLMs, such as Dense and MOE layers.

Currently, this project supports all custom CUDA operators required by vLLM-MetaX, all custom CUDA operators relied upon by SGLang, and all CUDA-based operators necessary for LMDeploy. Additionally, it accommodates customized CUDA operators for large enterprise scenarios and traditional CV-oriented CUDA operators. mcoplib provides core operators essential for inference and training of mainstream LLMs, including Qwen3, DeepSeek, Gemini, and GLM. The library further supports operators for multiple quantization formats, intra-machine and inter-machine cluster communication, typical fused operators, and K/V cache operators.

The mcoplib operator API is primarily categorized into three types:

Traditional CV and CNN operators expose C/C++ APIs. Custom CUDA operators are exposed via Python APIs using pybind11. Operators required by mainstream LLM inference frameworks (e.g., vLLM, SGLang) provide dynamic registration APIs for integration with PyTorch.

Framework

## Profiling

This tool enables operator profiling performance debugging with a single line of code, avoiding repetitive and cumbersome torch.profiler calls in the code.

mxbench

mxbench is used for operator performance benchmarking. The performance metrics include:

Memory bandwidth (BW)
GPU Time
CPU Time
Data Size

Currently, only C/C++ external operator APIs are supported. Support for Python API operator performance benchmarking will be added in the future.

mxbench Performance Example

NumElements	DataSize	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil	Samples
16777216	64.000 MiB	2624x	219.242 us	31.08%	191.086 us	1.66%	87.799G	702.396 GB/s	38.11%	2818x

MetaX C280

NumElements	DataSize	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil	Samples
16777216	64.000 MiB	2592x	225.837 us	23.57%	193.315 us	1.52%	86.787G	694.295 GB/s	37.67%	2806x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

mcoplib

Framework

mxbench

mxbench Performance Example

MetaX C280

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally