An experimental reinforcement learning project to train an autonomous agent to play Clash Royale via computer vision and automated input (no private game APIs). The agent observes the game through screen capture of a BlueStacks emulator window, converts pixels into a structured state, and executes actions using ADB shell commands.
We are in an early tooling phase validating capture, preprocessing, and input pipelines before implementing the full Gymnasium environment and PPO training loop.
- Screen Capture (Windows): Using
windows-capture(event-driven) for the BlueStacks window.DXcamretained as fallback. - Standard Frame Resolution: All frames cropped then downscaled to 480x854 for deterministic, lower‑compute downstream processing.
- ADB Control: Pure Python
adb-shelllibrary (no external adb binary) for launching the game and issuing shell/input commands. - Template Matching: OpenCV template matching prototype for card detection proof-of-concept.
- Environment Config:
.envdriven configuration for crop offsets, window name, paths, and device connection.
The agent uses a Structured MLP Policy with a shared card encoder to efficiently process the 53-dimensional state vector. This architecture is designed to handle the distinct types of information present in the game state.
Architecture Breakdown:
- Global Processor: Processes the 13-dimensional global state, which includes elixir, time, tower health, and game phase indicators.
- Shared Card Encoder: Processes the 10-dimensional features for each of the four cards in the player's hand. This shared encoder allows the model to learn a single representation for cards that can be applied to any card in any slot.
- Fusion Layer: Combines the representations from the global processor and the card encoder into a single, fused representation.
- Final Decision Layer: Produces action logits for the
MultiDiscrete([4, 32, 18])action space, which corresponds to the card slot, x-coordinate, and y-coordinate of the deployment.
This structured approach allows the model to learn more efficiently by processing the global and card-specific features separately before combining them to make a final decision.
| Script | Purpose |
|---|---|
windows-capture-testing.py |
One-shot frame grab for a specified BlueStacks window. |
windows-capture-with required resolution.py |
Capture → crop → resize → save normalized 480p frame. |
downscale.py |
Standalone image downscaling utility (defaults 480x854). |
Card_Template_matching_Example.py |
OpenCV template match demo for card detection latency check. |
py_adb.py |
Interactive adb-shell client + auto launch of Clash Royale intent. |
frame-extract.py |
Extract frames from recorded gameplay videos at intervals for dataset creation. |
.env.example |
Template environment variables (window name, crop, device IP/port, asset paths). |
requirements.txt |
Minimal dependency list for current tooling layer. |
Create a .env based on .env.example:
WINDOW_NAME="BlueStacks App Player 1"
CROP_LEFT=657
CROP_RIGHT=657
TARGET_WIDTH=480
TARGET_HEIGHT=854
ADB_DEVICE_IP=127.0.0.1
ADB_DEVICE_PORT=5555
CARD_IMAGE_PATH=path/to/card.png
GAME_STATE_IMAGE_PATH=path/to/state.png
| Layer | Description |
|---|---|
| Capture & Preprocess | Event-driven window capture → crop → standard 480p tensor. |
| CV Extraction | YOLO (troops), OCR (tower health), pixel counting (elixir), template match (cards). |
| State Assembly | Handcrafted feature vector (global + unit + tower + hand features). |
| Action Space | Hierarchical MultiDiscrete: [card_slot, x_tile, y_tile] with action masking. |
| RL Core | PPO (Stable-Baselines3) with composite reward (terminal + shaping). |
| Network | Multi-modal (CNN for spatial grid + MLP for scalars, dual actor/critic heads). |
| Scaling (Future) | Parallel rollouts & self-play (Ray RLlib) after single-instance stability. |
- Integrate capture + downscale + template match into a single prototype pipeline script.
- Add timing benchmarks (capture latency, preprocessing ms, template match ms).
- Introduce structured logging & error handling wrappers.
- Draft minimal
ClashRoyaleEnvscaffold (reset/step placeholders) once pipeline stable.
The agent uses PaddleOCR 2.7.3 with ONNX Runtime for high-speed digit extraction from elixir and tower health regions.
- Original PaddleOCR 3.x: ~800ms total (150ms elixir + 650ms towers)
- PaddleOCR 2.7.3 Standard: ~437ms total (62.51ms × 7 calls)
- ONNX Runtime Sequential: ~301ms total (43.01ms × 7 calls)
- ONNX Runtime Parallel (7 workers): ~91ms total ⚡ (3.84x speedup)
Total speedup: 8.8x faster than original (800ms → 91ms)
Run the setup script to download and convert PaddleOCR models to ONNX format:
.\setup_paddleocr2_onnx.ps1This script will:
- Download PP-OCRv3 detection, PP-OCRv4 recognition, and v2.0 classification models
- Extract and convert them to ONNX format using paddle2onnx
- Save to
inference/det_onnx/,inference/rec_onnx/, andinference/cls_onnx/
Requirements:
- PaddleOCR 2.7.3 (not 3.x)
- NumPy 1.26.4 (for imgaug compatibility)
- paddle2onnx 2.0.2rc3 (install via
paddlex --install paddle2onnx)
The OCR system processes 7 ROIs in parallel:
- 1 Elixir counter: Single digit (0-10)
- 6 Tower health displays: 3-4 digits (friendly + enemy king/princess towers)
Uses ThreadPoolExecutor with 7 workers for parallel ONNX inference, achieving near-linear speedup due to ONNX Runtime's thread-safe C++ implementation.
python -m venv .venv
.\.venv\Scripts\activate # Windows PowerShell
pip install -r requirements.txt
cp HelperScripts/.env.example .env
# Edit .env with correct window name & paths
# Set up ONNX models for fast OCR
.\setup_paddleocr2_onnx.ps1
# Test helper scripts
python HelperScripts/windows-capture-testing.py
python HelperScripts/windows-capture-with\ required\ resolution.py
python HelperScripts/py_adb.py- Coordinate jitter & variable tap delays.
- Avoid deterministic frame pacing for action issuance.
- Optional throttling & humanized randomization for non-critical actions.
This README will expand as the project transitions from tooling prototypes to the formal RL environment and training stack.