-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Welcome to the mcoplib wiki!
mcoplib is a comprehensive operator library designed to support AI network architectures including LLMs, CNNs, DNNs, and CV models. It encompasses not only CUDA-based AI operators but also Triton-based AI operators, offering an extensive range of supported operator types—from conventional AI operators for CV and CNN tasks to commonly used and fused operators for LLMs, such as Dense and MOE layers.
Currently, this project supports all custom CUDA operators required by vLLM-MetaX, all custom CUDA operators relied upon by SGLang, and all CUDA-based operators necessary for LMDeploy. Additionally, it accommodates customized CUDA operators for large enterprise scenarios and traditional CV-oriented CUDA operators. mcoplib provides core operators essential for inference and training of mainstream LLMs, including Qwen3, DeepSeek, Gemini, and GLM. The library further supports operators for multiple quantization formats, intra-machine and inter-machine cluster communication, typical fused operators, and K/V cache operators.
The mcoplib operator API is primarily categorized into three types:
Traditional CV and CNN operators expose C/C++ APIs. Custom CUDA operators are exposed via Python APIs using pybind11. Operators required by mainstream LLM inference frameworks (e.g., vLLM, SGLang) provide dynamic registration APIs for integration with PyTorch.
## Profiling
This tool enables operator profiling performance debugging with a single line of code,
avoiding repetitive and cumbersome torch.profiler calls in the code.
mxbench is used for operator performance benchmarking. The performance metrics include:
- Memory bandwidth (BW)
- GPU Time
- CPU Time
- Data Size
Currently, only C/C++ external operator APIs are supported. Support for Python API operator performance benchmarking will be added in the future.
| NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples |
|---|---|---|---|---|---|---|---|---|---|---|
| 16777216 | 64.000 MiB | 2624x | 219.242 us | 31.08% | 191.086 us | 1.66% | 87.799G | 702.396 GB/s | 38.11% | 2818x |
| NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples |
|---|---|---|---|---|---|---|---|---|---|---|
| 16777216 | 64.000 MiB | 2592x | 225.837 us | 23.57% | 193.315 us | 1.52% | 86.787G | 694.295 GB/s | 37.67% | 2806x |