This document explains subsystem structure and intended technical boundaries. It is not the source of truth for support status, roadmap scope, or current execution progress.
Use these documents for neighboring questions:
- Support status:
docs/SUPPORT_MATRIX.md - Supported product scope:
docs/PRODUCT_CORE.md - Current execution work:
docs/microtasks.md - Claims evidence:
docs/audits/claims_audit.md - Architectural boundary rules:
docs/architecture/BOUNDARY_RULES.md - Observability and logging:
docs/OBSERVABILITY.md - Performance baselines:
docs/PERFORMANCE.md - Terminology:
docs/GLOSSARY.md
RootStream is built on one core principle: Use the kernel APIs directly. Every abstraction layer adds latency, complexity, and failure points. We bypass them all.
Application
↓
Compositor (X11/Wayland)
↓
PipeWire
↓
xdg-desktop-portal
↓
Permission Dialog ← USER MUST CLICK EVERY TIME
↓
GStreamer/FFmpeg
↓
Encoder
Problems:
- Each layer adds latency (estimated 2-10ms per layer)
- Any layer can break
- Wayland security model may require permissions
- Compositor crashes can affect dependent layers
/dev/dri/card0 (DRM)
↓
mmap() framebuffer
↓
VA-API encoder
↓
UDP socket
Benefits:
- 3 layers instead of 7+
- Uses kernel APIs (stable for 10+ years)
- Reduced permission requirements (video group membership)
- Reduced compositor dependencies
- Latency targets exist, but benchmark-backed proof belongs in dedicated benchmark and performance docs rather than this architecture summary
What is DRM? Direct Rendering Manager - the kernel subsystem that manages GPU access.
What is KMS? Kernel Mode Setting - the part of DRM that controls displays.
How We Use It:
// 1. Open DRM device
int fd = open("/dev/dri/card0", O_RDWR);
// 2. Query available displays
struct drm_mode_card_res resources;
ioctl(fd, DRM_IOCTL_MODE_GETRESOURCES, &resources);
// 3. Get framebuffer info
struct drm_mode_fb_cmd fb_info;
fb_info.fb_id = <active_framebuffer_id>;
ioctl(fd, DRM_IOCTL_MODE_GETFB, &fb_info);
// 4. Map framebuffer to memory
struct drm_mode_map_dumb map_request;
map_request.handle = fb_info.handle;
ioctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &map_request);
// 5. mmap and read pixels
void *pixels = mmap(0, size, PROT_READ, MAP_SHARED, fd, map_request.offset);
memcpy(output, pixels, size);
munmap(pixels, size);Why This Works:
- Compositor writes to framebuffer
- We read from same framebuffer
- Kernel handles synchronization
- No compositor involvement needed
Limitations:
- Captures entire framebuffer (all windows)
- Can't capture individual windows (requires compositor integration)
- Ideal for fullscreen applications
- For desktop streaming, captures all visible content
Performance (example measurements):
- Capture time: ~1-2ms (direct memory copy)
- No GPU→CPU transfer overhead (framebuffer in system RAM)
- Zero-copy optimizations possible with proper configuration
What is VA-API? Video Acceleration API - hardware video encoding/decoding interface.
Current acceleration codepaths in tree:
- Intel: All modern integrated + discrete GPUs
- AMD: AMDGPU driver (GCN 1.0+)
- NVIDIA: NVIDIA-oriented code exists in-tree, but the current public support story is still being reconciled
How It Works:
// 1. Initialize VA-API
VADisplay display = vaGetDisplayDRM(drm_fd);
vaInitialize(display, &major, &minor);
// 2. Create encoding configuration
VAConfigAttrib attrib;
attrib.type = VAConfigAttribRateControl;
attrib.value = VA_RC_CBR; // Constant bitrate
vaCreateConfig(display, VAProfileH264High, VAEntrypointEncSlice,
&attrib, 1, &config_id);
// 3. Create surfaces (render targets)
vaCreateSurfaces(display, VA_RT_FORMAT_YUV420, width, height,
surfaces, num_surfaces, NULL, 0);
// 4. Create encoding context
vaCreateContext(display, config_id, width, height, VA_PROGRESSIVE,
surfaces, num_surfaces, &context_id);
// 5. Upload frame and encode
vaBeginPicture(display, context_id, surface);
// ... set encoding parameters ...
vaEndPicture(display, context_id);
vaSyncSurface(display, surface);
// 6. Get encoded data
vaMapBuffer(display, coded_buffer_id, &output_data);Encoding Pipeline:
- RGB framebuffer → NV12 colorspace conversion
- Upload to VA surface
- Hardware H.264 encoding
- Download encoded bitstream
Current Limitations (TODO):
- Colorspace conversion is simplified
- Need proper RGB→NV12 with SSE/AVX
- H.264 encoding parameters need tuning
- Missing: SPS/PPS parameter generation
- Missing: Rate control optimization
Performance (example measurements on specific hardware):
- Intel UHD 730: ~8-12ms encode time (1080p60)
- AMD RX 6600: ~6-10ms encode time (1080p60)
- CPU usage: <5% (hardware encoder offload)
Actual encode time varies by GPU model, driver version, and encode parameters.
Protocol Design:
┌─────────────────────────────────────┐
│ Packet Header (16 bytes) │
├─────────────────────────────────────┤
│ Magic: 0x524F4F54 ("ROOT") │ 4 bytes
│ Version: 1 │ 1 byte
│ Type: VIDEO/AUDIO/INPUT/CONTROL │ 1 byte
│ Sequence: packet counter │ 2 bytes
│ Timestamp: unix time (ms) │ 4 bytes
│ Payload Size: data length │ 4 bytes
│ Checksum: simple validation │ 2 bytes
├─────────────────────────────────────┤
│ Payload (variable, max 1384 bytes) │
└─────────────────────────────────────┘
Why UDP?
- TCP can add significant latency due to retransmission on packet loss
- For real-time streaming, dropped frames are preferable to delayed frames
- UDP provides fine-grained control over packet handling
- Drawback: No built-in reliability; application must handle packet loss
MTU Consideration:
- Ethernet MTU: 1500 bytes
- IP header: 20 bytes
- UDP header: 8 bytes
- Our header: 16 bytes
- Available payload: 1456 bytes
- We use 1384 to leave room for overhead
Packet Types:
PACKET_VIDEO (0x01): Encoded video framePACKET_AUDIO (0x02): Encoded audio (not implemented)PACKET_INPUT (0x03): Keyboard/mouse eventsPACKET_CONTROL (0x04): Connection control
The client’s job is to receive, decrypt, decode, and present frames with minimal buffering.
UDP recv → decrypt → decode (VA-API/NVDEC) → present (SDL2 or DRM/KMS)
↑ ↓
input events ← capture (uinput) ← send to host
Design Notes:
- No deep buffers: we drop late frames rather than queue them.
- Clock-aware: future work will use timestamps to align video/audio.
- Input-first: input events are sent immediately and should bypass queues.
Planned Client Components:
- Decode backend: VA-API decode for Intel/AMD, NVDEC for NVIDIA.
- Presentation: SDL2 for convenience, DRM/KMS for lowest latency.
- Input: uinput injection on host, raw input capture on client.
RootStream instruments the host loop to make regression detection easy.
Host stages:
- Capture → Encode → Send
- Samples are recorded per frame in a fixed ring buffer.
- Periodic summaries print p50/p95/p99 in microseconds.
Client stages (planned):
- Receive → Decode → Present
This mirrors the roadmap requirement for deterministic latency reporting across both sides of the stream.
Error Handling:
- Checksum validates payload integrity
- Sequence number detects packet loss
- No retransmission (drop and continue)
- Future: Simple FEC (forward error correction)
Future Improvements:
- Adaptive bitrate based on packet loss
- Congestion control (LEDBAT-style)
- Multicast for multiple clients
- QUIC for better reliability without latency
What is uinput? Kernel module for creating virtual input devices. Used by:
- Emulators (mapping controllers)
- Remote desktop tools
- Accessibility software
How We Use It:
// 1. Open uinput device
int fd = open("/dev/uinput", O_WRONLY | O_NONBLOCK);
// 2. Configure device capabilities
ioctl(fd, UI_SET_EVBIT, EV_KEY); // Keyboard events
ioctl(fd, UI_SET_EVBIT, EV_REL); // Relative mouse
ioctl(fd, UI_SET_KEYBIT, BTN_LEFT); // Mouse buttons
ioctl(fd, UI_SET_RELBIT, REL_X); // Mouse X axis
// 3. Setup device metadata
struct uinput_setup setup;
strcpy(setup.name, "RootStream Virtual Mouse");
setup.id.bustype = BUS_USB;
setup.id.vendor = 0x1234;
setup.id.product = 0x5678;
ioctl(fd, UI_DEV_SETUP, &setup);
// 4. Create the device
ioctl(fd, UI_DEV_CREATE);
// Device now appears in /dev/input/eventX
// 5. Emit events
struct input_event ev;
ev.type = EV_KEY;
ev.code = KEY_A;
ev.value = 1; // Press
write(fd, &ev, sizeof(ev));
// Sync event (flush)
ev.type = EV_SYN;
ev.code = SYN_REPORT;
write(fd, &ev, sizeof(ev));Why This Works:
- Applications see it as a real device
- Works on X11, Wayland, console (!)
- No need for X11/Wayland-specific APIs
- Games can't detect it's virtual
Input Event Flow:
- Client captures keyboard/mouse
- Encode into
input_event_pkt_t - Send via UDP
- Host receives packet
- Write to uinput device
- Kernel forwards to active application
Security Note:
- uinput requires
/dev/uinputaccess - Usually requires
rootorinputgroup - We use
videogroup (user already needs it for DRM) - Could be abused for keylogging - don't run untrusted code
Note: These are example measurements from specific test configurations. Actual performance varies by hardware, drivers, network conditions, and system load.
Capture:
- DRM query: ~0.1ms
- mmap: ~0.2ms
- memcpy: 1-2ms
- Total: ~2ms
Encoding (VA-API, Intel UHD 730):
- Color conversion: 2-3ms
- VA-API upload: ~1ms
- Hardware encode: 8-12ms
- Download: ~1ms
- Total: ~12-17ms
Network (LAN, gigabit ethernet):
- Packetization: ~0.1ms
- UDP send: ~0.1ms
- Network transit: 1-5ms (varies by network)
- Receive: ~0.1ms
- Total: ~1-5ms
Decoding (client, estimated, VA-API):
- VA-API decode: 5-8ms
- Display: 1-2ms
- Total: ~6-10ms
Input (reverse path, estimated):
- Capture: ~0.1ms
- Network: 1-5ms
- uinput: ~0.1ms
- Total: ~1-5ms
Total End-to-End Latency (estimated):
- Best case: ~20ms (optimal conditions, local network)
- Typical: 25-30ms (home network, typical conditions)
- Worst case: 40ms+ (network congestion, Wi-Fi interference)
These measurements are from Intel i5-11400 + Intel UHD 730 on gigabit LAN. Your results will vary based on hardware, network, and configuration.
At 1080p60 (Intel i5-11400 with VA-API):
- Capture: 1-2%
- Color conversion: 2-3%
- Encoding overhead: <1%
- Network: <1%
- Total: ~5-8%
Hardware encoder does most work; software encoder (x264) would use 40-60% CPU.
- Frame buffers: 8MB (4 surfaces × 2MB)
- Encoding buffers: 2MB
- Network buffers: 4MB
- Code + stack: <1MB
- Total: ~15MB
Compare to Steam (500MB+), Sunshine (200MB+).
At different qualities:
| Resolution | FPS | Bitrate | Data/min |
|---|---|---|---|
| 1080p | 60 | 10 Mbps | 75 MB |
| 1080p | 30 | 5 Mbps | 37 MB |
| 1440p | 60 | 15 Mbps | 112 MB |
| 4K | 30 | 20 Mbps | 150 MB |
Note: H.265 would halve these at same quality.
/dev/dri/card*- GPU framebuffer (read)/dev/dri/renderD*- Hardware encoder (read/write)/dev/uinput- Virtual input creation (write)- Network sockets - UDP (send/receive)
- User must be in
videogroup (for DRM access) uinputmodule must be loaded- No root required after setup
Theoretical Attacks:
- Screen capture - We can read framebuffer
- Mitigation: User explicitly runs this
- Input injection - We can inject keystrokes
- Mitigation: Only when user connects
- Network - Unencrypted UDP
- Mitigation: Use on trusted networks
- TODO: Add TLS/DTLS
Not Vulnerable To:
- Buffer overflows (we validate sizes)
- Code injection (no dynamic code)
- Privilege escalation (no setuid)
TODO Security Improvements:
- Encryption (TLS/DTLS for network)
- Authentication (client must prove identity)
- Access control (whitelist allowed clients)
What Steam Does:
- Uses PipeWire on Wayland
- Uses Desktop Duplication API on Windows
- NVFBC on NVIDIA (but disabled for consumers)
- Software encoding fallback
Why We're Different:
- Direct DRM (no PipeWire)
- Linux-only (optimized for one platform)
- Works on consumer hardware
- Simpler codebase
What Sunshine Does:
- Multiple capture backends (KMS, X11, Wayland)
- NVIDIA GameStream protocol
- More features (HDR, multi-monitor)
Why We're Different:
- Single, optimized path (DRM only)
- Custom protocol (not GameStream)
- Minimal dependencies
- Proof of concept vs production software
Traditional Remote Desktop:
- Designed for productivity, not gaming
- High latency (50-100ms)
- Low framerate (10-30 FPS)
- No hardware encoding
We're Optimized For Games:
- Low latency (20-30ms)
- High framerate (60+ FPS)
- Hardware encoding
- Minimal compression artifacts
-
Client Implementation
- VA-API decoder
- SDL2 or DRM display
- Input capture
-
Color Conversion
- Proper RGB→NV12
- SIMD optimization (SSE4/AVX2)
-
H.264 Parameter Tuning
- Proper SPS/PPS
- Rate control optimization
-
NVENC Support
- Direct NVENC API (not VA-API wrapper)
- Better quality than VA-API wrapper
-
Audio Streaming
- ALSA direct capture
- Opus encoding
- Synchronized with video
-
Multi-Monitor
- Capture specific display
- Multi-display client
-
H.265/HEVC
- Better compression
- Lower bandwidth
- Requires newer hardware
-
Adaptive Bitrate
- Monitor packet loss
- Adjust quality dynamically
- Maintain smooth framerate
-
Multi-Client
- Support multiple viewers
- Each gets own stream
-
Security
- TLS encryption
- Client authentication
- Certificate pinning
- C99 standard
- 4 spaces (no tabs)
- Max 100 chars per line
- Comments explain "why", not "what"
Current testing is manual. Need:
- Unit tests (capture, encode, network)
- Integration tests (full pipeline)
- Performance benchmarks
- Stress tests
- Latency: <25ms average
- CPU: <10% on modern hardware
- Memory: <20MB
- FPS: Match display refresh rate
Questions? Found a bug? Want to contribute?
Open an issue or submit a PR!