desktop-env is a cutting-edge, real-time desktop environment designed specifically for desktop-based machine learning development โ perfect for creating agents, world models, and more.
In the realm of open-source agent research, three critical components are often missing:
- ๐ Open-Source Environments
- ๐ Open-Source Data
- ๐ Open-Source Research Codebases & Repositories
desktop-env is here to fill these gaps:
- ๐ฅ๏ธ Open-Source Environment: Provides a rich, desktop-based environment identical to what humans use daily.
- ๐ Data Recorder: Includes a built-in screen, audio, timestamp, keyboard/mouse recorder to capture and utilize real human desktop interactions.
- ๐ค Future Research Collaboration: Plans are underway to foster open-source research in a new repository.
Any kind of open-source contributions are always welcome.
- โก Real-Time Performance: Achieve sub-1ms latency in screen capture.
- ๐ฅ High-Frequency Capture: Supports over 144 FPS screen recording with minimal CPU/GPU load.
- Utilizes Windows APIs (
DXGI/WGC) and the powerful GStreamer framework, which is largely differ fromPIL.ImageGrab,mss, ...
- Utilizes Windows APIs (
- ๐ฑ๏ธ Authentic Desktop Interaction: Work within the exact desktop environment used by real users.
- ๐บ Screen: Capture your monitor screen; specify monitor index, window name, framerate.
- โจ๏ธ๐ฑ๏ธ Keyboard/Mouse: Capture and input keyboard and mouse events.
- ๐ช Window: Get active window's name, bounding box, and handle (
hWnd).
โจ Supported Operating Systems:
- Windows: Full support with optimized performance using Direct3D11
- macOS: Full support using AVFoundation for screen capture
- Linux: Basic support (work in progress)
Since Recorder utilize desktop_env, it is far more efficient than any other existing python-based screen recorders.
- run just by typing
python3 examples/recorder.py FILE_LOCATIONand stop byCtrl+C - almost 0% load in CPU/GPU. (Similar to commercial screen recording / broadcasting software, since it utilize Windows APIs (
DXGI/WGC) and the powerful GStreamer framework under the hood) - screen, audio, timestamp is recorded all in once in matroska(
.mkv) container, timestamp is recorded as video subtitle. keyboard, mouse, window data is recorded all in once inevent.jsonlfile.
For more detail, run python3 examples/recorder.py --help!
Usage: recorder.py [OPTIONS] FILE_LOCATION
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * file_location TEXT The location of the output file, use `.mkv` extension. [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --record-audio --no-record-audio Whether to record audio [default: record-audio] โ
โ --record-video --no-record-video Whether to record video [default: record-video] โ
โ --record-timestamp --no-record-timestamp Whether to record timestamp [default: record-timestamp] โ
โ --window-name TEXT The name of the window to capture, substring of window name is supported [default: None] โ
โ --monitor-idx INTEGER The index of the monitor to capture [default: None] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
desktop-env outperforms other screen capture libraries:
| Library | Avg. Time per Frame | Relative Speed |
|---|---|---|
| desktop-env | 5.7 ms | โก1ร (Fastest) |
pyscreenshot |
33 ms | ๐ถโโ๏ธ 5.8ร slower |
PIL |
34 ms | ๐ถโโ๏ธ 6.0ร slower |
MSS |
37 ms | ๐ถโโ๏ธ 6.5ร slower |
PyQt5 |
137 ms | ๐ข 24ร slower |
Measured on i5-11400, GTX 1650. Not only is FPS measured, but CPU/GPU resource usage is also significantly lower.
owa-ztype-demo.mp4
For more details with self-contained running source code, see examples/typing_agent.
For full runnable scripts, see scripts/minimal_example.py, scripts/main.py.
from desktop_env import Desktop, DesktopArgs
from desktop_env.msg import FrameStamped
from desktop_env.windows_capture import construct_pipeline
def on_frame_arrived(frame: FrameStamped):
# Frame arrived at {frame.timestamp}, latency: {latency} ms, frame shape: {frame.shape}
pass
def on_event(event):
# event_type='{event.type}' event_data={event.data} event_time={event.time} device_name='{event.device}'
# title='{event.title}' rect={event.rect} hWnd={event.hWnd}
pass
if __name__ == "__main__":
args = DesktopArgs(
submodules=[
{
"module": "desktop_env.windows_capture.WindowsCapture",
"args": {
"on_frame_arrived": on_frame_arrived,
"pipeline_description": construct_pipeline(
window_name=None, # you may specify a substring of the window name
monitor_idx=None, # you may specify the monitor index
framerate="60/1",
),
},
},
{"module": "desktop_env.window_publisher.WindowPublisher", "args": {"callback": on_event}},
{
"module": "desktop_env.control_publisher.ControlPublisher",
"args": {"keyboard_callback": on_event, "mouse_callback": on_event},
},
]
)
desktop = Desktop.from_args(args)
try:
# Option 1: Start the pipeline in the current thread (blocking)
# desktop.start()
# Option 2: Start the pipeline in a separate thread (non-blocking)
desktop.start_free_threaded()
while True:
time.sleep(1)
except KeyboardInterrupt:
pass
finally:
desktop.stop()
desktop.join()
desktop.close()Prerequisites: Install poetry first. See the Poetry Installation Guide.
# 1. Install GStreamer and dependencies via conda
conda install -c conda-forge pygobject gst-python -y
# pygobject: PyGObject is a Python package which provides bindings for GObject based libraries such as GTK+, GStreamer, WebKitGTK+, GLib, GIO and many more.
# gst-python: `python` plugin, loader for plugins written in python
conda install -c conda-forge gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly -y
# 2. Install desktop-env
poetry install --with windowsInstall custom plugin, by configuring environment variable.
$env:GST_PLUGIN_PATH = (Join-Path -Path $pwd -ChildPath "custom_plugin")
echo $env:GST_PLUGIN_PATH
# 1. Install GStreamer and dependencies via brew
brew install gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly pkg-config gobject-introspection
# 2. Install desktop-env with macOS dependencies
poetry install --with macosInstall custom plugin, by configuring environment variable as Windows guide.
๐จ Notes:
- Installing
pygobjectwithpipon Windows causes the error:
..\meson.build:31:9: ERROR: Dependency 'gobject-introspection-1.0' is required but not found.
- On macOS, if you encounter permission issues with
brew, you might need to fix permissions:
sudo chown -R $(whoami) $(brew --prefix)/*After installation, verify it with the following commands:
# Check GStreamer version (should be >= 1.24.6)
$ conda list gst-*
# packages in environment at C:\Users\...\miniconda3\envs\agent:
#
# Name Version Build Channel
gst-plugins-bad 1.24.6 he11079b_0 conda-forge
gst-plugins-base 1.24.6 hb0a98b8_0 conda-forge
gst-plugins-good 1.24.6 h3b23867_0 conda-forge
gst-plugins-ugly 1.24.6 ha7af72c_0 conda-forge
gstreamer 1.24.6 h5006eae_0 conda-forge
# Verify Direct3D11 plugin
$ gst-inspect-1.0.exe d3d11# Check GStreamer version
$ gst-inspect-1.0 --version
gst-inspect-1.0 version 1.24.6
# Verify AVFoundation plugin
$ gst-inspect-1.0 avfvideosrc
Plugin Details:
Name avfvideosrc
Description AVFoundation video source
Filename /opt/homebrew/lib/gstreamer-1.0/libgstavfvideosrc.so
Version 1.24.6
License LGPL
Source module gst-plugins-good
Binary package GStreamer Good Plug-ins source release- ๐ฅ๏ธ Validate overall modality matching in multi-monitor setting
- ๐ Implement remote desktop control demo that wraps up Desktop and exposes network interface through UDP/TCP, HTTP/WebSocket, etc.
- ๐ฅ Support various video formats besides raw RGBA (JPEG, H.264, ...)
- ๐ง๐ Add multi-OS support (Linux & macOS)
- ๐ฌ Implement language interfaces to support desktop agents written in various languages