Skip to content

Latest commit

 

History

History
587 lines (454 loc) · 16.2 KB

File metadata and controls

587 lines (454 loc) · 16.2 KB

RootStream Technical Architecture

This document explains subsystem structure and intended technical boundaries. It is not the source of truth for support status, roadmap scope, or current execution progress.

Use these documents for neighboring questions:

Design Philosophy

RootStream is built on one core principle: Use the kernel APIs directly. Every abstraction layer adds latency, complexity, and failure points. We bypass them all.

Why Direct Kernel Access?

The Traditional Stack (Broken)

Application
    ↓
Compositor (X11/Wayland)
    ↓
PipeWire
    ↓
xdg-desktop-portal
    ↓
Permission Dialog ← USER MUST CLICK EVERY TIME
    ↓
GStreamer/FFmpeg
    ↓
Encoder

Problems:

  • Each layer adds latency (estimated 2-10ms per layer)
  • Any layer can break
  • Wayland security model may require permissions
  • Compositor crashes can affect dependent layers

The RootStream Stack

/dev/dri/card0 (DRM)
    ↓
mmap() framebuffer
    ↓
VA-API encoder
    ↓
UDP socket

Benefits:

  • 3 layers instead of 7+
  • Uses kernel APIs (stable for 10+ years)
  • Reduced permission requirements (video group membership)
  • Reduced compositor dependencies
  • Latency targets exist, but benchmark-backed proof belongs in dedicated benchmark and performance docs rather than this architecture summary

Component Details

1. DRM/KMS Capture (drm_capture.c)

What is DRM? Direct Rendering Manager - the kernel subsystem that manages GPU access.

What is KMS? Kernel Mode Setting - the part of DRM that controls displays.

How We Use It:

// 1. Open DRM device
int fd = open("/dev/dri/card0", O_RDWR);

// 2. Query available displays
struct drm_mode_card_res resources;
ioctl(fd, DRM_IOCTL_MODE_GETRESOURCES, &resources);

// 3. Get framebuffer info
struct drm_mode_fb_cmd fb_info;
fb_info.fb_id = <active_framebuffer_id>;
ioctl(fd, DRM_IOCTL_MODE_GETFB, &fb_info);

// 4. Map framebuffer to memory
struct drm_mode_map_dumb map_request;
map_request.handle = fb_info.handle;
ioctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &map_request);

// 5. mmap and read pixels
void *pixels = mmap(0, size, PROT_READ, MAP_SHARED, fd, map_request.offset);
memcpy(output, pixels, size);
munmap(pixels, size);

Why This Works:

  • Compositor writes to framebuffer
  • We read from same framebuffer
  • Kernel handles synchronization
  • No compositor involvement needed

Limitations:

  • Captures entire framebuffer (all windows)
  • Can't capture individual windows (requires compositor integration)
  • Ideal for fullscreen applications
  • For desktop streaming, captures all visible content

Performance (example measurements):

  • Capture time: ~1-2ms (direct memory copy)
  • No GPU→CPU transfer overhead (framebuffer in system RAM)
  • Zero-copy optimizations possible with proper configuration

2. VA-API Encoding (vaapi_encoder.c)

What is VA-API? Video Acceleration API - hardware video encoding/decoding interface.

Current acceleration codepaths in tree:

  • Intel: All modern integrated + discrete GPUs
  • AMD: AMDGPU driver (GCN 1.0+)
  • NVIDIA: NVIDIA-oriented code exists in-tree, but the current public support story is still being reconciled

How It Works:

// 1. Initialize VA-API
VADisplay display = vaGetDisplayDRM(drm_fd);
vaInitialize(display, &major, &minor);

// 2. Create encoding configuration
VAConfigAttrib attrib;
attrib.type = VAConfigAttribRateControl;
attrib.value = VA_RC_CBR;  // Constant bitrate
vaCreateConfig(display, VAProfileH264High, VAEntrypointEncSlice, 
               &attrib, 1, &config_id);

// 3. Create surfaces (render targets)
vaCreateSurfaces(display, VA_RT_FORMAT_YUV420, width, height,
                 surfaces, num_surfaces, NULL, 0);

// 4. Create encoding context
vaCreateContext(display, config_id, width, height, VA_PROGRESSIVE,
                surfaces, num_surfaces, &context_id);

// 5. Upload frame and encode
vaBeginPicture(display, context_id, surface);
// ... set encoding parameters ...
vaEndPicture(display, context_id);
vaSyncSurface(display, surface);

// 6. Get encoded data
vaMapBuffer(display, coded_buffer_id, &output_data);

Encoding Pipeline:

  1. RGB framebuffer → NV12 colorspace conversion
  2. Upload to VA surface
  3. Hardware H.264 encoding
  4. Download encoded bitstream

Current Limitations (TODO):

  • Colorspace conversion is simplified
  • Need proper RGB→NV12 with SSE/AVX
  • H.264 encoding parameters need tuning
  • Missing: SPS/PPS parameter generation
  • Missing: Rate control optimization

Performance (example measurements on specific hardware):

  • Intel UHD 730: ~8-12ms encode time (1080p60)
  • AMD RX 6600: ~6-10ms encode time (1080p60)
  • CPU usage: <5% (hardware encoder offload)

Actual encode time varies by GPU model, driver version, and encode parameters.

3. Network Protocol (network.c)

Protocol Design:

┌─────────────────────────────────────┐
│ Packet Header (16 bytes)            │
├─────────────────────────────────────┤
│ Magic: 0x524F4F54 ("ROOT")         │  4 bytes
│ Version: 1                          │  1 byte
│ Type: VIDEO/AUDIO/INPUT/CONTROL     │  1 byte
│ Sequence: packet counter            │  2 bytes
│ Timestamp: unix time (ms)           │  4 bytes
│ Payload Size: data length           │  4 bytes
│ Checksum: simple validation         │  2 bytes
├─────────────────────────────────────┤
│ Payload (variable, max 1384 bytes) │
└─────────────────────────────────────┘

Why UDP?

  • TCP can add significant latency due to retransmission on packet loss
  • For real-time streaming, dropped frames are preferable to delayed frames
  • UDP provides fine-grained control over packet handling
  • Drawback: No built-in reliability; application must handle packet loss

MTU Consideration:

  • Ethernet MTU: 1500 bytes
  • IP header: 20 bytes
  • UDP header: 8 bytes
  • Our header: 16 bytes
  • Available payload: 1456 bytes
  • We use 1384 to leave room for overhead

Packet Types:

  • PACKET_VIDEO (0x01): Encoded video frame
  • PACKET_AUDIO (0x02): Encoded audio (not implemented)
  • PACKET_INPUT (0x03): Keyboard/mouse events
  • PACKET_CONTROL (0x04): Connection control

4. Client-Side Loop (Decode + Present)

The client’s job is to receive, decrypt, decode, and present frames with minimal buffering.

UDP recv → decrypt → decode (VA-API/NVDEC) → present (SDL2 or DRM/KMS)
           ↑                                 ↓
           input events ← capture (uinput) ← send to host

Design Notes:

  • No deep buffers: we drop late frames rather than queue them.
  • Clock-aware: future work will use timestamps to align video/audio.
  • Input-first: input events are sent immediately and should bypass queues.

Planned Client Components:

  • Decode backend: VA-API decode for Intel/AMD, NVDEC for NVIDIA.
  • Presentation: SDL2 for convenience, DRM/KMS for lowest latency.
  • Input: uinput injection on host, raw input capture on client.

5. Latency Instrumentation

RootStream instruments the host loop to make regression detection easy.

Host stages:

  • Capture → Encode → Send
  • Samples are recorded per frame in a fixed ring buffer.
  • Periodic summaries print p50/p95/p99 in microseconds.

Client stages (planned):

  • Receive → Decode → Present

This mirrors the roadmap requirement for deterministic latency reporting across both sides of the stream.

Error Handling:

  • Checksum validates payload integrity
  • Sequence number detects packet loss
  • No retransmission (drop and continue)
  • Future: Simple FEC (forward error correction)

Future Improvements:

  • Adaptive bitrate based on packet loss
  • Congestion control (LEDBAT-style)
  • Multicast for multiple clients
  • QUIC for better reliability without latency

4. Input Injection (input.c)

What is uinput? Kernel module for creating virtual input devices. Used by:

  • Emulators (mapping controllers)
  • Remote desktop tools
  • Accessibility software

How We Use It:

// 1. Open uinput device
int fd = open("/dev/uinput", O_WRONLY | O_NONBLOCK);

// 2. Configure device capabilities
ioctl(fd, UI_SET_EVBIT, EV_KEY);     // Keyboard events
ioctl(fd, UI_SET_EVBIT, EV_REL);     // Relative mouse
ioctl(fd, UI_SET_KEYBIT, BTN_LEFT);  // Mouse buttons
ioctl(fd, UI_SET_RELBIT, REL_X);     // Mouse X axis

// 3. Setup device metadata
struct uinput_setup setup;
strcpy(setup.name, "RootStream Virtual Mouse");
setup.id.bustype = BUS_USB;
setup.id.vendor = 0x1234;
setup.id.product = 0x5678;
ioctl(fd, UI_DEV_SETUP, &setup);

// 4. Create the device
ioctl(fd, UI_DEV_CREATE);
// Device now appears in /dev/input/eventX

// 5. Emit events
struct input_event ev;
ev.type = EV_KEY;
ev.code = KEY_A;
ev.value = 1;  // Press
write(fd, &ev, sizeof(ev));

// Sync event (flush)
ev.type = EV_SYN;
ev.code = SYN_REPORT;
write(fd, &ev, sizeof(ev));

Why This Works:

  • Applications see it as a real device
  • Works on X11, Wayland, console (!)
  • No need for X11/Wayland-specific APIs
  • Games can't detect it's virtual

Input Event Flow:

  1. Client captures keyboard/mouse
  2. Encode into input_event_pkt_t
  3. Send via UDP
  4. Host receives packet
  5. Write to uinput device
  6. Kernel forwards to active application

Security Note:

  • uinput requires /dev/uinput access
  • Usually requires root or input group
  • We use video group (user already needs it for DRM)
  • Could be abused for keylogging - don't run untrusted code

Performance Analysis

Note: These are example measurements from specific test configurations. Actual performance varies by hardware, drivers, network conditions, and system load.

Example Latency Breakdown (1080p60, LAN)

Capture:

  • DRM query: ~0.1ms
  • mmap: ~0.2ms
  • memcpy: 1-2ms
  • Total: ~2ms

Encoding (VA-API, Intel UHD 730):

  • Color conversion: 2-3ms
  • VA-API upload: ~1ms
  • Hardware encode: 8-12ms
  • Download: ~1ms
  • Total: ~12-17ms

Network (LAN, gigabit ethernet):

  • Packetization: ~0.1ms
  • UDP send: ~0.1ms
  • Network transit: 1-5ms (varies by network)
  • Receive: ~0.1ms
  • Total: ~1-5ms

Decoding (client, estimated, VA-API):

  • VA-API decode: 5-8ms
  • Display: 1-2ms
  • Total: ~6-10ms

Input (reverse path, estimated):

  • Capture: ~0.1ms
  • Network: 1-5ms
  • uinput: ~0.1ms
  • Total: ~1-5ms

Total End-to-End Latency (estimated):

  • Best case: ~20ms (optimal conditions, local network)
  • Typical: 25-30ms (home network, typical conditions)
  • Worst case: 40ms+ (network congestion, Wi-Fi interference)

These measurements are from Intel i5-11400 + Intel UHD 730 on gigabit LAN. Your results will vary based on hardware, network, and configuration.

Example CPU Usage

At 1080p60 (Intel i5-11400 with VA-API):

  • Capture: 1-2%
  • Color conversion: 2-3%
  • Encoding overhead: <1%
  • Network: <1%
  • Total: ~5-8%

Hardware encoder does most work; software encoder (x264) would use 40-60% CPU.

Example Memory Usage

  • Frame buffers: 8MB (4 surfaces × 2MB)
  • Encoding buffers: 2MB
  • Network buffers: 4MB
  • Code + stack: <1MB
  • Total: ~15MB

Compare to Steam (500MB+), Sunshine (200MB+).

Network Bandwidth

At different qualities:

Resolution FPS Bitrate Data/min
1080p 60 10 Mbps 75 MB
1080p 30 5 Mbps 37 MB
1440p 60 15 Mbps 112 MB
4K 30 20 Mbps 150 MB

Note: H.265 would halve these at same quality.

Security Considerations

What We Access

  • /dev/dri/card* - GPU framebuffer (read)
  • /dev/dri/renderD* - Hardware encoder (read/write)
  • /dev/uinput - Virtual input creation (write)
  • Network sockets - UDP (send/receive)

Permissions Required

  • User must be in video group (for DRM access)
  • uinput module must be loaded
  • No root required after setup

Attack Surface

Theoretical Attacks:

  1. Screen capture - We can read framebuffer
    • Mitigation: User explicitly runs this
  2. Input injection - We can inject keystrokes
    • Mitigation: Only when user connects
  3. Network - Unencrypted UDP
    • Mitigation: Use on trusted networks
    • TODO: Add TLS/DTLS

Not Vulnerable To:

  • Buffer overflows (we validate sizes)
  • Code injection (no dynamic code)
  • Privilege escalation (no setuid)

TODO Security Improvements:

  • Encryption (TLS/DTLS for network)
  • Authentication (client must prove identity)
  • Access control (whitelist allowed clients)

Comparison to Alternatives

vs Steam Remote Play

What Steam Does:

  • Uses PipeWire on Wayland
  • Uses Desktop Duplication API on Windows
  • NVFBC on NVIDIA (but disabled for consumers)
  • Software encoding fallback

Why We're Different:

  • Direct DRM (no PipeWire)
  • Linux-only (optimized for one platform)
  • Works on consumer hardware
  • Simpler codebase

vs Sunshine/Moonlight

What Sunshine Does:

  • Multiple capture backends (KMS, X11, Wayland)
  • NVIDIA GameStream protocol
  • More features (HDR, multi-monitor)

Why We're Different:

  • Single, optimized path (DRM only)
  • Custom protocol (not GameStream)
  • Minimal dependencies
  • Proof of concept vs production software

vs VNC/RDP

Traditional Remote Desktop:

  • Designed for productivity, not gaming
  • High latency (50-100ms)
  • Low framerate (10-30 FPS)
  • No hardware encoding

We're Optimized For Games:

  • Low latency (20-30ms)
  • High framerate (60+ FPS)
  • Hardware encoding
  • Minimal compression artifacts

Future Work

Short Term (v0.2)

  1. Client Implementation

    • VA-API decoder
    • SDL2 or DRM display
    • Input capture
  2. Color Conversion

    • Proper RGB→NV12
    • SIMD optimization (SSE4/AVX2)
  3. H.264 Parameter Tuning

    • Proper SPS/PPS
    • Rate control optimization

Medium Term (v0.3)

  1. NVENC Support

    • Direct NVENC API (not VA-API wrapper)
    • Better quality than VA-API wrapper
  2. Audio Streaming

    • ALSA direct capture
    • Opus encoding
    • Synchronized with video
  3. Multi-Monitor

    • Capture specific display
    • Multi-display client

Long Term (v1.0)

  1. H.265/HEVC

    • Better compression
    • Lower bandwidth
    • Requires newer hardware
  2. Adaptive Bitrate

    • Monitor packet loss
    • Adjust quality dynamically
    • Maintain smooth framerate
  3. Multi-Client

    • Support multiple viewers
    • Each gets own stream
  4. Security

    • TLS encryption
    • Client authentication
    • Certificate pinning

Contributing

Code Style

  • C99 standard
  • 4 spaces (no tabs)
  • Max 100 chars per line
  • Comments explain "why", not "what"

Testing

Current testing is manual. Need:

  • Unit tests (capture, encode, network)
  • Integration tests (full pipeline)
  • Performance benchmarks
  • Stress tests

Performance Goals

  • Latency: <25ms average
  • CPU: <10% on modern hardware
  • Memory: <20MB
  • FPS: Match display refresh rate

References

Documentation

Similar Projects

Learning Resources


Questions? Found a bug? Want to contribute?

Open an issue or submit a PR!