Skip to content

KernelPort is a high-performance inference server that lets you run models from TensorFlow, PyTorch, ONNX, and TensorRT without giving up their native GPU kernels.

License

Notifications You must be signed in to change notification settings

beatgeek/kernelport

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KernelPort

KernelPort is a high-performance inference server that lets you run models from TensorFlow, PyTorch, ONNX, and TensorRT without giving up their native GPU kernels.

Bring your kernels with you. Serve them from one runtime.


Why KernelPort?

Most inference servers force you into a single runtime or a lowest-common-denominator format. KernelPort takes a different approach:

  • Native execution: TensorFlow runs with TensorFlow kernels. PyTorch runs with PyTorch kernels.
  • GPU-first design: Built for CUDA streams, batching, and memory reuse.
  • One server, many backends: ONNX Runtime, TensorRT, LibTorch, TensorFlow.
  • Production-grade: Dynamic batching, scheduling, backpressure, observability.

KernelPort is written in Rust for safety, performance, and predictable latency.


What KernelPort Is (and Is Not)

KernelPort is:

  • An inference orchestrator with pluggable execution backends
  • A way to consolidate heterogeneous model stacks into one serving plane
  • A system designed for low-latency, high-throughput GPU inference

KernelPort is not:

  • A training framework
  • A model compiler
  • A replacement for TensorFlow or PyTorch

Core Concepts

Native Backend Execution

KernelPort does not reimplement ML frameworks. Instead, it embeds and orchestrates their production runtimes:

Backend Execution
PyTorch LibTorch (TorchScript / compiled artifacts)
TensorFlow TensorFlow C API (SavedModel)
ONNX ONNX Runtime (CUDA / TensorRT EP)
TensorRT Native TensorRT engines

This ensures you get:

  • Vendor-optimized kernels
  • Framework correctness
  • No silent performance regressions

Dynamic Batching

Requests are dynamically batched per model and shape:

  • Configurable max batch size
  • Tight batching windows (ms-level)
  • Shape-aware grouping

This improves GPU utilization without sacrificing latency SLAs.


GPU-Aware Scheduling

KernelPort schedules work across:

  • GPUs (or MIG slices)
  • CUDA streams
  • Backend-specific execution contexts

The scheduler can be tuned for:

  • Throughput
  • Latency
  • Fairness across models or tenants

Architecture (High Level)

KernelPort Architecture

KernelPort GPU Worker


Local Development

CPU (macOS or Linux)

  • Install Rust components: rustup component add rustfmt clippy
  • Run checks: make check
  • Run the server:
    • cargo run -p kernelport-server -- serve --device cpu
  • ORT identity test (CPU): cargo test -p kernelport-backend-ort --test identity

CUDA (Linux + NVIDIA)

KernelPort supports CUDA via ONNX Runtime. For GPU inference:

  • Provide a CUDA-enabled ONNX Runtime shared library (build from source or use a CUDA-enabled package).
  • Set ORT_DYLIB_PATH to the CUDA-enabled libonnxruntime.so.
  • Build with the CUDA feature and select a device:
    • cargo run -p kernelport-server --features ort-cuda -- serve --device cuda:0

About

KernelPort is a high-performance inference server that lets you run models from TensorFlow, PyTorch, ONNX, and TensorRT without giving up their native GPU kernels.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published