RootStream Technical Architecture

This document explains subsystem structure and intended technical boundaries. It is not the source of truth for support status, roadmap scope, or current execution progress.

Use these documents for neighboring questions:

Support status: docs/SUPPORT_MATRIX.md
Supported product scope: docs/PRODUCT_CORE.md
Current execution work: docs/microtasks.md
Claims evidence: docs/audits/claims_audit.md
Architectural boundary rules: docs/architecture/BOUNDARY_RULES.md
Observability and logging: docs/OBSERVABILITY.md
Performance baselines: docs/PERFORMANCE.md
Terminology: docs/GLOSSARY.md

Design Philosophy

RootStream is built on one core principle: Use the kernel APIs directly. Every abstraction layer adds latency, complexity, and failure points. We bypass them all.

Why Direct Kernel Access?

The Traditional Stack (Broken)

Application
    ↓
Compositor (X11/Wayland)
    ↓
PipeWire
    ↓
xdg-desktop-portal
    ↓
Permission Dialog ← USER MUST CLICK EVERY TIME
    ↓
GStreamer/FFmpeg
    ↓
Encoder

Problems:

Each layer adds latency (estimated 2-10ms per layer)
Any layer can break
Wayland security model may require permissions
Compositor crashes can affect dependent layers

The RootStream Stack

/dev/dri/card0 (DRM)
    ↓
mmap() framebuffer
    ↓
VA-API encoder
    ↓
UDP socket

Benefits:

3 layers instead of 7+
Uses kernel APIs (stable for 10+ years)
Reduced permission requirements (video group membership)
Reduced compositor dependencies
Latency targets exist, but benchmark-backed proof belongs in dedicated benchmark and performance docs rather than this architecture summary

Component Details

1. DRM/KMS Capture (`drm_capture.c`)

What is DRM? Direct Rendering Manager - the kernel subsystem that manages GPU access.

What is KMS? Kernel Mode Setting - the part of DRM that controls displays.

How We Use It:

// 1. Open DRM device
int fd = open("/dev/dri/card0", O_RDWR);

// 2. Query available displays
struct drm_mode_card_res resources;
ioctl(fd, DRM_IOCTL_MODE_GETRESOURCES, &resources);

// 3. Get framebuffer info
struct drm_mode_fb_cmd fb_info;
fb_info.fb_id = <active_framebuffer_id>;
ioctl(fd, DRM_IOCTL_MODE_GETFB, &fb_info);

// 4. Map framebuffer to memory
struct drm_mode_map_dumb map_request;
map_request.handle = fb_info.handle;
ioctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &map_request);

// 5. mmap and read pixels
void *pixels = mmap(0, size, PROT_READ, MAP_SHARED, fd, map_request.offset);
memcpy(output, pixels, size);
munmap(pixels, size);

Why This Works:

Compositor writes to framebuffer
We read from same framebuffer
Kernel handles synchronization
No compositor involvement needed

Limitations:

Captures entire framebuffer (all windows)
Can't capture individual windows (requires compositor integration)
Ideal for fullscreen applications
For desktop streaming, captures all visible content

Performance (example measurements):

Capture time: ~1-2ms (direct memory copy)
No GPU→CPU transfer overhead (framebuffer in system RAM)
Zero-copy optimizations possible with proper configuration

2. VA-API Encoding (`vaapi_encoder.c`)

What is VA-API? Video Acceleration API - hardware video encoding/decoding interface.

Current acceleration codepaths in tree:

Intel: All modern integrated + discrete GPUs
AMD: AMDGPU driver (GCN 1.0+)
NVIDIA: NVIDIA-oriented code exists in-tree, but the current public support story is still being reconciled

How It Works:

// 1. Initialize VA-API
VADisplay display = vaGetDisplayDRM(drm_fd);
vaInitialize(display, &major, &minor);

// 2. Create encoding configuration
VAConfigAttrib attrib;
attrib.type = VAConfigAttribRateControl;
attrib.value = VA_RC_CBR;  // Constant bitrate
vaCreateConfig(display, VAProfileH264High, VAEntrypointEncSlice, 
               &attrib, 1, &config_id);

// 3. Create surfaces (render targets)
vaCreateSurfaces(display, VA_RT_FORMAT_YUV420, width, height,
                 surfaces, num_surfaces, NULL, 0);

// 4. Create encoding context
vaCreateContext(display, config_id, width, height, VA_PROGRESSIVE,
                surfaces, num_surfaces, &context_id);

// 5. Upload frame and encode
vaBeginPicture(display, context_id, surface);
// ... set encoding parameters ...
vaEndPicture(display, context_id);
vaSyncSurface(display, surface);

// 6. Get encoded data
vaMapBuffer(display, coded_buffer_id, &output_data);

Encoding Pipeline:

RGB framebuffer → NV12 colorspace conversion
Upload to VA surface
Hardware H.264 encoding
Download encoded bitstream

Current Limitations (TODO):

Colorspace conversion is simplified
Need proper RGB→NV12 with SSE/AVX
H.264 encoding parameters need tuning
Missing: SPS/PPS parameter generation
Missing: Rate control optimization

Performance (example measurements on specific hardware):

Intel UHD 730: ~8-12ms encode time (1080p60)
AMD RX 6600: ~6-10ms encode time (1080p60)
CPU usage: <5% (hardware encoder offload)

Actual encode time varies by GPU model, driver version, and encode parameters.

3. Network Protocol (`network.c`)

Protocol Design:

┌─────────────────────────────────────┐
│ Packet Header (16 bytes)            │
├─────────────────────────────────────┤
│ Magic: 0x524F4F54 ("ROOT")         │  4 bytes
│ Version: 1                          │  1 byte
│ Type: VIDEO/AUDIO/INPUT/CONTROL     │  1 byte
│ Sequence: packet counter            │  2 bytes
│ Timestamp: unix time (ms)           │  4 bytes
│ Payload Size: data length           │  4 bytes
│ Checksum: simple validation         │  2 bytes
├─────────────────────────────────────┤
│ Payload (variable, max 1384 bytes) │
└─────────────────────────────────────┘

Why UDP?

TCP can add significant latency due to retransmission on packet loss
For real-time streaming, dropped frames are preferable to delayed frames
UDP provides fine-grained control over packet handling
Drawback: No built-in reliability; application must handle packet loss

MTU Consideration:

Ethernet MTU: 1500 bytes
IP header: 20 bytes
UDP header: 8 bytes
Our header: 16 bytes
Available payload: 1456 bytes
We use 1384 to leave room for overhead

Packet Types:

PACKET_VIDEO (0x01): Encoded video frame
PACKET_AUDIO (0x02): Encoded audio (not implemented)
PACKET_INPUT (0x03): Keyboard/mouse events
PACKET_CONTROL (0x04): Connection control

4. Client-Side Loop (Decode + Present)

The client’s job is to receive, decrypt, decode, and present frames with minimal buffering.

UDP recv → decrypt → decode (VA-API/NVDEC) → present (SDL2 or DRM/KMS)
           ↑                                 ↓
           input events ← capture (uinput) ← send to host

Design Notes:

No deep buffers: we drop late frames rather than queue them.
Clock-aware: future work will use timestamps to align video/audio.
Input-first: input events are sent immediately and should bypass queues.

Planned Client Components:

Decode backend: VA-API decode for Intel/AMD, NVDEC for NVIDIA.
Presentation: SDL2 for convenience, DRM/KMS for lowest latency.
Input: uinput injection on host, raw input capture on client.

5. Latency Instrumentation

RootStream instruments the host loop to make regression detection easy.

Host stages:

Capture → Encode → Send
Samples are recorded per frame in a fixed ring buffer.
Periodic summaries print p50/p95/p99 in microseconds.

Client stages (planned):

Receive → Decode → Present

This mirrors the roadmap requirement for deterministic latency reporting across both sides of the stream.

Error Handling:

Checksum validates payload integrity
Sequence number detects packet loss
No retransmission (drop and continue)
Future: Simple FEC (forward error correction)

Future Improvements:

Adaptive bitrate based on packet loss
Congestion control (LEDBAT-style)
Multicast for multiple clients
QUIC for better reliability without latency

4. Input Injection (`input.c`)

What is uinput? Kernel module for creating virtual input devices. Used by:

Emulators (mapping controllers)
Remote desktop tools
Accessibility software

How We Use It:

// 1. Open uinput device
int fd = open("/dev/uinput", O_WRONLY | O_NONBLOCK);

// 2. Configure device capabilities
ioctl(fd, UI_SET_EVBIT, EV_KEY);     // Keyboard events
ioctl(fd, UI_SET_EVBIT, EV_REL);     // Relative mouse
ioctl(fd, UI_SET_KEYBIT, BTN_LEFT);  // Mouse buttons
ioctl(fd, UI_SET_RELBIT, REL_X);     // Mouse X axis

// 3. Setup device metadata
struct uinput_setup setup;
strcpy(setup.name, "RootStream Virtual Mouse");
setup.id.bustype = BUS_USB;
setup.id.vendor = 0x1234;
setup.id.product = 0x5678;
ioctl(fd, UI_DEV_SETUP, &setup);

// 4. Create the device
ioctl(fd, UI_DEV_CREATE);
// Device now appears in /dev/input/eventX

// 5. Emit events
struct input_event ev;
ev.type = EV_KEY;
ev.code = KEY_A;
ev.value = 1;  // Press
write(fd, &ev, sizeof(ev));

// Sync event (flush)
ev.type = EV_SYN;
ev.code = SYN_REPORT;
write(fd, &ev, sizeof(ev));

Why This Works:

Applications see it as a real device
Works on X11, Wayland, console (!)
No need for X11/Wayland-specific APIs
Games can't detect it's virtual

Input Event Flow:

Client captures keyboard/mouse
Encode into input_event_pkt_t
Send via UDP
Host receives packet
Write to uinput device
Kernel forwards to active application

Security Note:

uinput requires /dev/uinput access
Usually requires root or input group
We use video group (user already needs it for DRM)
Could be abused for keylogging - don't run untrusted code

Performance Analysis

Note: These are example measurements from specific test configurations. Actual performance varies by hardware, drivers, network conditions, and system load.

Example Latency Breakdown (1080p60, LAN)

Capture:

DRM query: ~0.1ms
mmap: ~0.2ms
memcpy: 1-2ms
Total: ~2ms

Encoding (VA-API, Intel UHD 730):

Color conversion: 2-3ms
VA-API upload: ~1ms
Hardware encode: 8-12ms
Download: ~1ms
Total: ~12-17ms

Network (LAN, gigabit ethernet):

Packetization: ~0.1ms
UDP send: ~0.1ms
Network transit: 1-5ms (varies by network)
Receive: ~0.1ms
Total: ~1-5ms

Decoding (client, estimated, VA-API):

VA-API decode: 5-8ms
Display: 1-2ms
Total: ~6-10ms

Input (reverse path, estimated):

Capture: ~0.1ms
Network: 1-5ms
uinput: ~0.1ms
Total: ~1-5ms

Total End-to-End Latency (estimated):

Best case: ~20ms (optimal conditions, local network)
Typical: 25-30ms (home network, typical conditions)
Worst case: 40ms+ (network congestion, Wi-Fi interference)

These measurements are from Intel i5-11400 + Intel UHD 730 on gigabit LAN. Your results will vary based on hardware, network, and configuration.

Example CPU Usage

At 1080p60 (Intel i5-11400 with VA-API):

Capture: 1-2%
Color conversion: 2-3%
Encoding overhead: <1%
Network: <1%
Total: ~5-8%

Hardware encoder does most work; software encoder (x264) would use 40-60% CPU.

Example Memory Usage

Frame buffers: 8MB (4 surfaces × 2MB)
Encoding buffers: 2MB
Network buffers: 4MB
Code + stack: <1MB
Total: ~15MB

Compare to Steam (500MB+), Sunshine (200MB+).

Network Bandwidth

At different qualities:

Resolution	FPS	Bitrate	Data/min
1080p	60	10 Mbps	75 MB
1080p	30	5 Mbps	37 MB
1440p	60	15 Mbps	112 MB
4K	30	20 Mbps	150 MB

Note: H.265 would halve these at same quality.

Security Considerations

What We Access

/dev/dri/card* - GPU framebuffer (read)
/dev/dri/renderD* - Hardware encoder (read/write)
/dev/uinput - Virtual input creation (write)
Network sockets - UDP (send/receive)

Permissions Required

User must be in video group (for DRM access)
uinput module must be loaded
No root required after setup

Attack Surface

Theoretical Attacks:

Screen capture - We can read framebuffer
- Mitigation: User explicitly runs this
Input injection - We can inject keystrokes
- Mitigation: Only when user connects
Network - Unencrypted UDP
- Mitigation: Use on trusted networks
- TODO: Add TLS/DTLS

Not Vulnerable To:

Buffer overflows (we validate sizes)
Code injection (no dynamic code)
Privilege escalation (no setuid)

TODO Security Improvements:

Encryption (TLS/DTLS for network)
Authentication (client must prove identity)
Access control (whitelist allowed clients)

Comparison to Alternatives

vs Steam Remote Play

What Steam Does:

Uses PipeWire on Wayland
Uses Desktop Duplication API on Windows
NVFBC on NVIDIA (but disabled for consumers)
Software encoding fallback

Why We're Different:

Direct DRM (no PipeWire)
Linux-only (optimized for one platform)
Works on consumer hardware
Simpler codebase

vs Sunshine/Moonlight

What Sunshine Does:

Multiple capture backends (KMS, X11, Wayland)
NVIDIA GameStream protocol
More features (HDR, multi-monitor)

Why We're Different:

Single, optimized path (DRM only)
Custom protocol (not GameStream)
Minimal dependencies
Proof of concept vs production software

vs VNC/RDP

Traditional Remote Desktop:

Designed for productivity, not gaming
High latency (50-100ms)
Low framerate (10-30 FPS)
No hardware encoding

We're Optimized For Games:

Low latency (20-30ms)
High framerate (60+ FPS)
Hardware encoding
Minimal compression artifacts

Future Work

Short Term (v0.2)

Client Implementation
- VA-API decoder
- SDL2 or DRM display
- Input capture
Color Conversion
- Proper RGB→NV12
- SIMD optimization (SSE4/AVX2)
H.264 Parameter Tuning
- Proper SPS/PPS
- Rate control optimization

Medium Term (v0.3)

NVENC Support
- Direct NVENC API (not VA-API wrapper)
- Better quality than VA-API wrapper
Audio Streaming
- ALSA direct capture
- Opus encoding
- Synchronized with video
Multi-Monitor
- Capture specific display
- Multi-display client

Long Term (v1.0)

H.265/HEVC
- Better compression
- Lower bandwidth
- Requires newer hardware
Adaptive Bitrate
- Monitor packet loss
- Adjust quality dynamically
- Maintain smooth framerate
Multi-Client
- Support multiple viewers
- Each gets own stream
Security
- TLS encryption
- Client authentication
- Certificate pinning

Contributing

Code Style

C99 standard
4 spaces (no tabs)
Max 100 chars per line
Comments explain "why", not "what"

Testing

Current testing is manual. Need:

Unit tests (capture, encode, network)
Integration tests (full pipeline)
Performance benchmarks
Stress tests

Performance Goals

Latency: <25ms average
CPU: <10% on modern hardware
Memory: <20MB
FPS: Match display refresh rate

References

Documentation

Similar Projects

Moonlight - NVIDIA GameStream client
Sunshine - GameStream host
RustDesk - Remote desktop

Learning Resources

Questions? Found a bug? Want to contribute?

Open an issue or submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RootStream Technical Architecture

Design Philosophy

Why Direct Kernel Access?

The Traditional Stack (Broken)

The RootStream Stack

Component Details

1. DRM/KMS Capture (`drm_capture.c`)

2. VA-API Encoding (`vaapi_encoder.c`)

3. Network Protocol (`network.c`)

4. Client-Side Loop (Decode + Present)

5. Latency Instrumentation

4. Input Injection (`input.c`)

Performance Analysis

Example Latency Breakdown (1080p60, LAN)

Example CPU Usage

Example Memory Usage

Network Bandwidth

Security Considerations

What We Access

Permissions Required

Attack Surface

Comparison to Alternatives

vs Steam Remote Play

vs Sunshine/Moonlight

vs VNC/RDP

Future Work

Short Term (v0.2)

Medium Term (v0.3)

Long Term (v1.0)

Contributing

Code Style

Testing

Performance Goals

References

Documentation

Similar Projects

Learning Resources

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

RootStream Technical Architecture

Design Philosophy

Why Direct Kernel Access?

The Traditional Stack (Broken)

The RootStream Stack

Component Details

1. DRM/KMS Capture (drm_capture.c)

2. VA-API Encoding (vaapi_encoder.c)

3. Network Protocol (network.c)

4. Client-Side Loop (Decode + Present)

5. Latency Instrumentation

4. Input Injection (input.c)

Performance Analysis

Example Latency Breakdown (1080p60, LAN)

Example CPU Usage

Example Memory Usage

Network Bandwidth

Security Considerations

What We Access

Permissions Required

Attack Surface

Comparison to Alternatives

vs Steam Remote Play

vs Sunshine/Moonlight

vs VNC/RDP

Future Work

Short Term (v0.2)

Medium Term (v0.3)

Long Term (v1.0)

Contributing

Code Style

Testing

Performance Goals

References

Documentation

Similar Projects

Learning Resources

1. DRM/KMS Capture (`drm_capture.c`)

2. VA-API Encoding (`vaapi_encoder.c`)

3. Network Protocol (`network.c`)

4. Input Injection (`input.c`)