Skip to content

Focus 2: Performance Engineering #32

@rainliu

Description

@rainliu

Focus 2: Performance Engineering

Performance is not an afterthought—it's a core requirement for real-time communication. This focus area encompasses systematic benchmarking, profiling, and optimization across the entire stack.

Benchmarking Infrastructure

Before optimizing, we need to measure. A comprehensive benchmarking infrastructure is essential.

Planned benchmark suite:

// Using criterion for statistical rigor
use criterion::{criterion_group, criterion_main, Criterion, Throughput};

fn bench_datachannel_throughput(c: &mut Criterion) {
    let mut group = c.benchmark_group("datachannel");

    for size in [64, 1024, 16384, 65536].iter() {
        group.throughput(Throughput::Bytes(*size as u64));
        group.bench_with_input(
            BenchmarkId::new("send", size),
            size,
            |b, &size| {
                b.iter(|| {
                    dc.send(&message[..size])
                });
            },
        );
    }
    group.finish();
}

fn bench_rtp_pipeline(c: &mut Criterion) {
    c.bench_function("rtp_parse", |b| {
        b.iter(|| RtpPacket::unmarshal(&packet_bytes))
    });

    c.bench_function("rtp_marshal", |b| {
        b.iter(|| packet.marshal_to(&mut buffer))
    });

    c.bench_function("srtp_encrypt", |b| {
        b.iter(|| context.encrypt_rtp(&mut packet))
    });

    c.bench_function("srtp_decrypt", |b| {
        b.iter(|| context.decrypt_rtp(&mut packet))
    });
}

criterion_group!(benches, bench_datachannel_throughput, bench_rtp_pipeline);
criterion_main!(benches);

Benchmark categories:

Category Metrics Tools
Throughput Messages/sec, Bytes/sec criterion, custom
Latency p50, p99, p999 criterion, hdr_histogram
Memory Allocations, peak usage dhat, heaptrack
CPU Cycles per operation perf, flamegraph

Profiling and Analysis

Profiling workflow:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Performance Analysis Workflow                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. Baseline         2. Profile           3. Analyze                       │
│   ┌─────────┐        ┌─────────┐          ┌─────────┐                       │
│   │ Run     │───────▶│ Collect │─────────▶│ Generate│                       │
│   │ Bench   │        │ Samples │          │ Reports │                       │
│   └─────────┘        └─────────┘          └─────────┘                       │
│        │                  │                    │                            │
│        ▼                  ▼                    ▼                            │
│   criterion          perf record          flamegraph                        │
│   results            + perf script        + hotspot analysis                │
│                                                                             │
│   4. Optimize         5. Validate          6. Document                      │
│   ┌─────────┐        ┌─────────┐          ┌─────────┐                       │
│   │ Apply   │───────▶│ Re-run  │─────────▶│ Record  │                       │
│   │ Changes │        │ Bench   │          │ Gains   │                       │
│   └─────────┘        └─────────┘          └─────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Profiling tools:

  • perf — Linux performance counters, CPU profiling
  • flamegraph — Visualize hot code paths
  • heaptrack — Memory allocation profiling
  • cargo-llvm-lines — Generic code bloat analysis
  • valgrind/cachegrind — Cache behavior analysis

DataChannel Optimization

WebRTC DataChannels are increasingly used for high-throughput applications. Optimization targets:

SCTP layer:

Optimization Description Expected Impact
Chunk batching Combine small messages into fewer SCTP chunks Reduce overhead 20-40%
Zero-copy I/O Avoid buffer copies in send/receive path Reduce CPU usage
TSN tracking Optimize sequence number management Reduce memory allocations
Congestion control Tune SCTP congestion parameters Improve throughput stability

Application layer:

  • Message framing optimization
  • Backpressure handling
  • Buffer pool for allocations

Performance targets:

Metric Baseline Target Notes
Throughput (reliable, ordered) TBD > 500 Mbps Single channel
Throughput (unreliable) TBD > 1 Gbps Best-effort
Latency (1KB message) TBD < 1 ms p99
Messages/second TBD > 100K Small messages

RTP/RTCP Pipeline Optimization

Media transport is latency-sensitive and high-volume.

Packet processing:

Incoming RTP Packet
        │
        ▼
┌───────────────┐
│ UDP Receive   │ ← Goal: zero-copy receive
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ SRTP Decrypt  │ ← Goal: hardware AES-NI
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ RTP Parse     │ ← Goal: minimal validation
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Interceptors  │ ← Goal: inline, no allocations
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Jitter Buffer │ ← Goal: lock-free, pre-allocated
└───────┬───────┘
        │
        ▼
    Application

Specific optimizations:

  • SIMD parsing — Use SIMD instructions for header parsing where beneficial
  • AES-NI — Ensure hardware acceleration for SRTP
  • Inline interceptors — Compile-time interceptor composition (already implemented via generics)
  • Pre-allocated buffers — Avoid per-packet allocations
  • Branch prediction — Optimize common code paths

ICE Performance

Connection establishment time directly impacts user experience.

Optimization areas:

Phase Current Target Approach
Candidate gathering TBD < 100ms Parallel STUN queries
Connectivity checks TBD < 500ms Prioritized pair testing
DTLS handshake TBD < 200ms Session resumption
Total time-to-media TBD < 1s Combined optimizations

Techniques:

  • Aggressive candidate nomination
  • Parallel connectivity checks
  • STUN response caching
  • Optimized candidate pair sorting

Memory Optimization

Real-time systems benefit from predictable memory behavior.

Goals:

  • Minimize allocations in hot paths
  • Use buffer pools for packet buffers
  • Pre-allocate data structures where possible
  • Reduce memory fragmentation

Tracking:

// Example: Using dhat for allocation profiling
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

#[test]
fn test_allocations_in_hot_path() {
    let _profiler = dhat::Profiler::new_heap();

    // Run hot path code
    for _ in 0..10000 {
        process_rtp_packet(&packet);
    }

    // Analyze allocation count and sizes
}

Continuous Performance Monitoring

CI integration:

  • Run benchmarks on every PR
  • Track performance regressions
  • Publish benchmark results
  • Alert on significant regressions

Planned dashboard metrics:

  • Throughput trends over time
  • Latency percentiles
  • Memory usage patterns
  • CPU efficiency

Metadata

Metadata

Assignees

No one assigned

    Labels

    p1high priority

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions