Skip to content

24kchengYe/desktop-controller-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

!Visitors

Desktop Controller — AI Computer Use for Claude Code

The open-source alternative to OpenAI Codex's computer use. Control any Windows app — native desktop, web, and Electron — with a single AI skill. Like OpenAI's playwright-interactive, but with native Win32 desktop app support that Codex can't do.

License: MIT

What GPT-5.4 does in the cloud, this skill does locally. OpenAI's GPT-5.4 introduced "computer use" — the ability to control computers via Playwright and mouse/keyboard commands. Their playwright-interactive Codex skill enables visual debugging of web and Electron apps. We go further: our dual-engine approach adds native Windows desktop automation (Win32 API) alongside Playwright, letting you control apps like WeChat, DingTalk, and QQ that browser-based solutions simply can't reach.

Why This Exists

OpenAI Codex Claude Code + This Skill
Web/Electron apps Playwright Playwright
Native desktop apps Not possible Win32 API
Chat apps (WeChat, DingTalk, QQ) Not possible Full support
Visual feedback js_repl screenshots Screenshots + Claude Vision
QA workflow Manual checklist Automated QA checklist
Viewport testing Manual Device presets (iPhone, iPad, Pixel...)
DOM inspection Via DevTools Built-in inspector
Platform Cloud only Local (your machine)
Cost Codex subscription Free & open source

Demo

$ python scripts/desktop_control.py list-apps
Supported applications:
  weixin       WeChat                process=Weixin       mode=win32   search=Ctrl+F
  wxwork       WeCom (企业微信)        process=WXWork       mode=win32   search=Ctrl+F
  dingtalk     DingTalk (钉钉)        process=DingTalk     mode=win32   search=Ctrl+K
  feishu       Feishu/Lark (飞书)     process=Feishu       mode=win32   search=Ctrl+K
  qq           QQ                    process=QQ           mode=win32   search=Ctrl+F
  telegram     Telegram              process=Telegram     mode=win32   search=Ctrl+K
  slack        Slack                 process=slack        mode=win32   search=Ctrl+K
  teams        Microsoft Teams       process=ms-teams     mode=win32   search=Ctrl+E

$ python scripts/desktop_control.py find-window --app weixin
{"found": true, "process": "Weixin", "left": 856, "top": 185, "width": 462, "height": 640}

$ python scripts/desktop_control.py send-message --app weixin --contact "文件传输助手" --message "Hello from AI 🤖"
✓ Message sent to 文件传输助手 via WeChat

Architecture

┌──────────────────────────────────────────────────────────┐
│               Desktop Controller Skill                    │
├───────────────┬────────────────────┬─────────────────────┤
│  Win32 Engine │  Playwright Engine │  Visual Feedback    │
│  (Native Apps)│  (Web/Electron)    │  (Screenshot + AI)  │
├───────────────┼────────────────────┼─────────────────────┤
│ FindWindow    │ page.click()       │ Screen Capture      │
│ SendKeys      │ page.fill()        │ Window Capture      │
│ SetCursorPos  │ page.goto()        │ → Claude Vision     │
│ mouse_event   │ page.screenshot()  │ → State Verify      │
│ Clipboard     │ page.evaluate()    │ → Auto Retry        │
│ GetWindowRect │ DOM Inspection     │ → QA Checklist      │
└───────────────┴────────────────────┴─────────────────────┘

Supported Applications

Native Desktop Apps (Win32 Engine)

App Process Search Key Status
WeChat (微信) Weixin Ctrl+F Tested & Verified
WeCom (企业微信) WXWork Ctrl+F Ready
DingTalk (钉钉) DingTalk Ctrl+K Ready
Feishu/Lark (飞书) Feishu Ctrl+K Ready
QQ QQ Ctrl+F Ready
Telegram Telegram Ctrl+K Ready
Slack slack Ctrl+K Ready
Microsoft Teams ms-teams Ctrl+E Ready

Web & Electron Apps (Playwright Engine)

Any website or Electron app — VS Code, Notion, Discord, Figma, and more.

Quick Start

Natural Language (via Claude Code)

Just tell Claude Code what you want:

"Send a WeChat message to 张三 saying 你好"
"给张三发钉钉消息说明天开会"
"Take a screenshot of my DingTalk window"
"Open https://example.com and click the login button"
"Run QA checklist on http://localhost:3000"
"Test my site on iPhone 14 viewport"
"Inspect all buttons on this page"
"帮我操控电脑自动发消息"

Command Line — Win32 Engine

# Send a message via any chat app
python scripts/desktop_control.py send-message --app weixin --contact "张三" --message "你好"

# Screenshot an app window
python scripts/desktop_control.py screenshot --app weixin --output wechat.png

# Full screen screenshot
python scripts/desktop_control.py screenshot --output screen.png

# Click at coordinates
python scripts/desktop_control.py click --app weixin --x 500 --y 400

# Type text
python scripts/desktop_control.py type --app dingtalk --text "Hello World"

# Find a window
python scripts/desktop_control.py find-window --app feishu

# List all supported apps
python scripts/desktop_control.py list-apps

Command Line — Playwright Engine

# Screenshot a web page
python scripts/playwright_control.py web-screenshot --url "https://example.com" --output page.png

# Click an element by CSS selector
python scripts/playwright_control.py web-click --url "https://example.com" --selector "#login-btn"

# Fill a form field
python scripts/playwright_control.py web-fill --url "https://example.com" --selector "input[name=email]" --text "test@example.com"

# Inspect DOM elements
python scripts/playwright_control.py web-inspect --url "https://example.com" --selector "button"

# Evaluate JavaScript
python scripts/playwright_control.py web-eval --url "https://example.com" --js "return document.title"

# Test with mobile viewport
python scripts/playwright_control.py viewport --device "iPhone 14" --url "https://example.com" --output mobile.png

# Run automated QA checklist
python scripts/playwright_control.py qa-checklist --url "http://localhost:3000"

Key Features

1. Dual-Engine Automation

  • Win32 Engine: Controls native Windows apps that no browser automation can reach
  • Playwright Engine: Full DOM access, CSS selectors, JavaScript evaluation

2. Visual Feedback Loop (like OpenAI's approach)

Execute Action → Screenshot → Claude Analyzes → Verify Success → Next Action
         ↑                                              │
         └──────────── Retry if Failed ←────────────────┘

3. Automated QA Checklist

One command runs functional, visual, viewport, and performance checks:

  • Page load verification
  • Broken link detection
  • Console error capture
  • Desktop + mobile screenshots
  • Horizontal overflow check
  • Performance timing (DOMContentLoaded, Load)

4. Device Viewport Testing

Built-in presets: iPhone 14, iPhone 14 Pro Max, iPad, Pixel 7, Desktop, Desktop HD, 4K

5. Unicode/CJK First-Class Support

Chinese text handling built-in via Unicode code point arrays — no encoding issues.

Technical Insights

Why Win32 Mouse Click Matters

The #1 discovery: after searching for a contact in chat apps, the message input area does NOT receive keyboard focus automatically. You must physically click on it using Win32 SetCursorPos + mouse_event. This single insight makes the difference between a working and broken automation.

Clipboard Safety Pattern

Windows clipboard can be locked by other processes. Always:

  1. Clipboard.Clear() before SetText()
  2. Retry up to 5 times with 300ms delay
  3. 100ms pause between Clear and Set

Installation

# Clone to Claude Code skills directory
git clone https://github.com/24kchengYe/desktop-controller-skill ~/.claude/skills/desktop-controller

# For Playwright features (optional)
cd ~/.claude/skills/desktop-controller
npm install playwright
npx playwright install chromium

Prerequisites

  • Windows OS with PowerShell
  • Python 3.8+
  • Node.js 18+ (for Playwright features)
  • Target apps running and logged in

Extending

Add new apps in scripts/app_registry.py:

"my_app": {
    "name": "My App",
    "aliases": ["myapp"],
    "process": "MyApp",
    "mode": "win32",
    "search_key": "^f",           # Ctrl+F
    "input_area": {"x_ratio": 0.65, "y_ratio_from_bottom": 0.12},
    ...
}

Related Projects

  • OpenAI Codex Skills — OpenAI's skill catalog including playwright-interactive
  • OpenAI Codex — OpenAI's coding agent (our skill brings similar computer-use capabilities to Claude Code)
  • Playwright — The browser automation framework powering our web engine
  • Claude Code — Anthropic's CLI coding agent

Star History

If this project helps you, please give it a star! It helps others discover it.

License

MIT — free for personal and commercial use.

About

AI Computer Use for Claude Code — The open-source alternative to OpenAI Codex's playwright-interactive. Dual-engine: Win32 API + Playwright. Control WeChat, DingTalk, Feishu, QQ, Slack, Teams, and any web/Electron app. Automated QA, viewport testing, visual feedback loops.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages