GAZSI: GPU-Accelerated Native Zero-copy for Socket-Based Inference Serving Framework

A high-performance GPU-accelerated HTTP server and packet processing framework using NVIDIA DOCA SDK and GPUNetIO.

Overview

GAZSI enables direct GPU processing of network packets, bypassing the CPU for minimal latency. The system implements:

GPU-Direct HTTP Server: HTTP request/response handling entirely on GPU
TensorRT Inference Integration: ML inference via lock-free CPU-GPU ring buffer
Dynamic Batching: Automatic request batching for improved throughput

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Network Interface                        │
│                  (ConnectX-6/7 SmartNIC)                    │
└──────────────────────────┬──────────────────────────────────┘
                           │ GPUNetIO
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                        GPU Memory                            │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │ TCP Receive │─►│ HTTP Parser  │─►│ Response Builder │   │
│  │   Kernel    │  │   Kernel     │  │     Kernel       │   │
│  └─────────────┘  └──────────────┘  └──────────────────┘   │
│         │                                                    │
│         ▼         Ring Buffer (UVM)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Inference Request Queue ◄──► CPU TensorRT Runner    │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Features

Zero-Copy Packet Processing: Direct NIC-to-GPU data path
Lock-Free Ring Buffer: Efficient CPU-GPU data exchange (128 slots)
Semaphore Signaling: Event-driven CPU-GPU synchronization
Dynamic Batch Inference: Up to 8 requests per batch
HTTP/1.1 Support: GET/POST request handling on GPU

Requirements

Hardware

NVIDIA GPU with Compute Capability 7.0+
NVIDIA ConnectX-6 or newer SmartNIC
8GB+ System Memory

Software

Ubuntu 20.04/22.04
DOCA SDK 2.0+
CUDA 11.8+ / 12.x
GCC 9+

Building

The project builds as part of the DOCA SDK application framework:

# Create symlink in DOCA applications directory
sudo ln -s /path/to/GAZSI \
    /opt/mellanox/doca/applications/GAZSI

# Build
cd /opt/mellanox/doca/applications
sudo meson build -Denable_GAZSI=true
sudo ninja -C build

Usage

# Basic usage
sudo ./GAZSI -g <GPU_PCI> -n <NIC_PCI> -q <NUM_QUEUES>

# With HTTP server
sudo ./GAZSI -g E6:00.0 -n c1:00.0 -q 2 -s

# Options:
#   -g  GPU PCIe address
#   -n  NIC PCIe address
#   -q  Number of receive queues (1-4)
#   -s  Enable HTTP server mode

HTTP Endpoints

Endpoint	Method	Description
`/index.`	GET	Static index page
`/contacts.`	GET	Static contacts page
`/inference?d=<text>`	GET	TensorRT inference endpoint

Inference Example

# Send inference request
curl "http://10.0.0.6/inference?d=hello%20world"

# Response
{
  "input": "hello world",
  "tokens": 3,
  "embedding_sample": [0.123, -0.456, 0.789],
  "inference_time_us": 1234,
  "batch_size": 1
}

Project Structure

GAZSI/
├── main.c                      # Application entry point
├── kernels/
│   ├── tcp_rx.cu               # TCP packet receive kernel
│   ├── http_server.cu          # HTTP request/response handling
│   └── filters.cuh             # Packet filter functions
├── inference/
│   ├── tensorrt.cu             # TensorRT inference wrapper
│   ├── tensorrt.h              # TensorRT header
│   ├── ring_buffer.cu          # Lock-free ring buffer
│   └── ring_buffer.h           # Ring buffer header
├── doca/
│   ├── tcp.c                   # TCP queue initialization
│   ├── flow.c                  # DOCA Flow configuration
│   ├── common.h                # Common definitions
│   └── defines.h               # Constants and macros
├── tcp/
│   ├── session.c               # TCP session management
│   └── cpu_rss.c               # CPU RSS fallback
└── meson.build                 # Build configuration

Performance

Typical latency breakdown for inference requests:

Stage	Time
T0→T1: Slot Allocation	~10 us
T1→T2: GPU Write to UVM	~5 us
T2→T3: CPU Detection	~50 us
T3→T5: TensorRT Inference	~500-2000 us
T5→T8: Response Sent	~100 us

License

BSD-3-Clause. See individual source files for details.

Acknowledgments

Built on NVIDIA DOCA SDK and GPUNetIO technology.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dependencies		dependencies
doca		doca
inference		inference
kernels		kernels
models		models
tcp		tcp
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
build_and_run.sh		build_and_run.sh
main.c		main.c
meson.build		meson.build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAZSI: GPU-Accelerated Native Zero-copy for Socket-Based Inference Serving Framework

Overview

Architecture

Features

Requirements

Hardware

Software

Building

Usage

HTTP Endpoints

Inference Example

Project Structure

Performance

License

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

NTHU-LSALAB/GAZSI

Folders and files

Latest commit

History

Repository files navigation

GAZSI: GPU-Accelerated Native Zero-copy for Socket-Based Inference Serving Framework

Overview

Architecture

Features

Requirements

Hardware

Software

Building

Usage

HTTP Endpoints

Inference Example

Project Structure

Performance

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages