GitHub - droidrun/ios-portal

A comprehensive iOS automation portal that provides HTTP API access to iOS device UI state extraction and automated interactions.

Overview

The Droidrun iOS Portal is a specialized iOS application that runs UI tests to expose device automation capabilities through a RESTful HTTP API. It consists of two main components:

Portal App (droidrun-ios-portal): A minimal SwiftUI application that serves as the host
Portal Server (droidrun-ios-portalUITests): XCTest-based HTTP server providing automation APIs

Architecture

The portal leverages iOS XCTest framework and XCUITest capabilities to:

Extract UI state information (accessibility trees, screenshots)
Perform automated interactions (taps, swipes, text input)
Launch and manage applications
Handle device-level inputs

Key Components

DroidrunPortalServer: XCTest class that runs an HTTP server on port 6643
DroidrunPortalHandler: HTTP route handler defining the REST API endpoints
DroidrunPortalTools: Core automation engine implementing device interactions
AccessibilityTree: UI state extraction and compression utilities

API Reference

Device Information

GET `/`

Returns basic device information and description.

Response:

{
  "description": "Device description string"
}

Vision & State Extraction

GET `/vision/state`

Retrieves current phone state including active app and keyboard status.

Response:

{
  "activity": "com.example.app - Screen Title",
  "keyboardShown": false
}

GET `/vision/a11y`

Extracts the accessibility tree of the current UI state.

Response:

{
  "accessibilityTree": "Compressed accessibility tree string"
}

GET `/vision/screenshot`

Captures a screenshot of the current screen.

Response: PNG image data (Content-Type: image/png)

App Management

POST `/inputs/launch`

Launches an application by bundle identifier.

Request Body:

{
  "bundleIdentifier": "com.example.app"
}

Response:

{
  "message": "opened com.example.app"
}

Gesture Automation

POST `/gestures/tap`

Performs tap gestures on screen coordinates.

Request Body:

{
  "rect": "{{x,y},{width,height}}",
  "count": 1,
  "longPress": false
}

Response:

{
  "message": "tapped element"
}

POST `/gestures/swipe`

Performs swipe gestures from specified coordinates.

Request Body:

{
  "x": 100.0,
  "y": 200.0,
  "dir": "up"
}

Supported directions: up, down, left, right

Response:

{
  "message": "swiped"
}

Input Automation

POST `/inputs/type`

Enters text into a focused input field.

Request Body:

{
  "rect": "{{x,y},{width,height}}",
  "text": "Hello World"
}

Response:

{
  "message": "entered text"
}

POST `/inputs/key`

Presses device hardware keys.

Request Body:

{
  "key": 0
}

Supported keys:

0: Home button
4: Action button
5: Camera button

Response:

{
  "message": "pressed key"
}

Features

UI State Extraction

Accessibility Tree: Compressed representation of the UI hierarchy with memory addresses removed
Screenshots: PNG format screen captures
App State: Current application context and keyboard status

Automation Capabilities

App Launching: Launch any installed app by bundle identifier
Touch Interactions: Single taps, double taps, long presses
Gesture Recognition: Swipe gestures in four directions
Text Input: Automated typing with keyboard handling
Hardware Keys: Device button presses

Smart Features

App Management: Automatic app switching and state management
Keyboard Detection: Intelligent keyboard presence detection
Focus Management: Ensures proper element focus for text input
Error Handling: Comprehensive error reporting and validation

Usage

Prerequisites

iOS device or simulator
Xcode with XCTest capabilities
Network access to the device

Running the Portal

Build and run the portal app on the target iOS device
The XCTest suite will automatically start the HTTP server on port 6643
The server will continue running until the test session ends

Client Integration

The portal is designed to work with automation agents that can:

Send HTTP requests to the portal endpoints
Process accessibility tree data for UI understanding
Coordinate multiple automation actions
Handle screenshot analysis for visual verification

Example Client Usage

import requests

# Get device info
response = requests.get('http://device-ip:6643/')
device_info = response.json()

# Take screenshot
screenshot = requests.get('http://device-ip:6643/vision/screenshot')
with open('screenshot.png', 'wb') as f:
    f.write(screenshot.content)

# Get accessibility tree
a11y = requests.get('http://device-ip:6643/vision/a11y').json()
print(a11y['accessibilityTree'])

# Launch app
requests.post('http://device-ip:6643/inputs/launch', 
              json={'bundleIdentifier': 'com.apple.mobilesafari'})

# Perform tap
requests.post('http://device-ip:6643/gestures/tap',
              json={'rect': '{{100,200},{50,50}}', 'count': 1})

A fully working example

import asyncio
from typing import List, Dict, Any, Tuple
from droidrun import IOSTools, DroidAgent
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class CompleteIOSTools(IOSTools):
    """Complete implementation of IOSTools with all required abstract methods."""

    def _set_context(self, ctx):
        """Set the workflow context (required by DroidAgent)."""
        self._ctx = ctx

    async def get_date(self) -> str:
        """Get the current date and time on iOS device."""
        try:
            import requests

            date_url = f"{self.url}/system/date"
            response = requests.get(date_url)
            if response.status_code == 200:
                return response.json().get("date", "Unknown")
            else:
                # Fallback to returning current system time
                import datetime

                return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        except Exception:
            import datetime

            return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    async def get_apps(self, include_system: bool = True) -> List[Dict[str, Any]]:
        """Get installed apps with bundle identifier and name."""
        # Use await because we override list_packages to be async
        packages = await self.list_packages(include_system_apps=include_system)
        # Convert to format expected by the method
        return [{"package": pkg, "label": pkg.split(".")[-1]} for pkg in packages]

    def _extract_element_coordinates_by_index(self, index: int) -> Tuple[int, int]:
        """Extract center coordinates from an element by its index."""
        if not self.clickable_elements_cache:
            raise ValueError("No UI elements cached. Call get_state first.")

        # Find element with the given index
        for element in self.clickable_elements_cache:
            if element.get("index") == index:
                center_x = element.get("center_x")
                center_y = element.get("center_y")
                if center_x is not None and center_y is not None:
                    return (int(center_x), int(center_y))

        raise ValueError(f"No element found with index {index}")

    async def input_text(self, text: str, index: int = -1, clear: bool = False) -> str:
        """
        Input text on the iOS device.

        Args:
            text: Text to input. Can contain spaces, newlines, and special characters including non-ASCII.
            index: Element index to input text into (optional, -1 means use last tapped element)
            clear: Whether to clear existing text before input (not currently supported for iOS)

        Returns:
            Result message
        """
        try:
            import requests
            import time

            # If index is provided and valid, tap on that element first
            if index >= 0:
                await self.tap_by_index(index)

            # Note: clear parameter is not currently supported by iOS portal API
            # Future enhancement could add support for clearing text

            # Use the last tapped element's rect if available, otherwise use a default
            rect = self.last_tapped_rect if self.last_tapped_rect else "0,0,100,100"

            type_url = f"{self.url}/inputs/type"
            payload = {"rect": rect, "text": text}

            response = requests.post(type_url, json=payload)
            if response.status_code == 200:
                time.sleep(0.5)  # Wait for text input to complete
                return f"Text input completed: {text[:50]}{'...' if len(text) > 50 else ''}"
            else:
                return f"Error: Failed to input text. HTTP {response.status_code}"

        except Exception as e:
            return f"Error sending text input: {str(e)}"

    async def tap_on_index(self, index: int) -> str:
        """Alias for tap_by_index."""
        return await self.tap_by_index(index)

    def _format_elements(self, elements: List[Dict[str, Any]]) -> str:
        """Format elements for LLM consumption."""
        lines = []
        for elem in elements:
            idx = elem.get("index")
            text = elem.get("text", "")
            type_ = elem.get("type", "")
            lines.append(f"[{idx}] {type_} '{text}'")
        return "\n".join(lines)

    # Overrides for IOSTools sync methods to make them async for DroidAgent compatibility

    async def get_state(self):
        # Call parent sync method
        state_dict = super().get_state()

        a11y_tree = state_dict.get("a11y_tree", [])
        phone_state = state_dict.get("phone_state", {})

        formatted_text = self._format_elements(a11y_tree)
        focused_text = ""  # No focus info for now

        return formatted_text, focused_text, a11y_tree, phone_state

    async def tap_by_index(self, index: int) -> str:
        return super().tap_by_index(index)

    async def swipe(
        self, start_x: int, start_y: int, end_x: int, end_y: int, duration_ms: int = 300
    ) -> bool:
        return super().swipe(start_x, start_y, end_x, end_y, duration_ms)

    async def drag(
        self,
        start_x: int,
        start_y: int,
        end_x: int,
        end_y: int,
        duration_ms: int = 3000,
    ) -> bool:
        return super().drag(start_x, start_y, end_x, end_y, duration_ms)

    async def back(self) -> str:
        try:
            return super().back()
        except NotImplementedError:
            return "Error: Back button is not supported on iOS"

    async def press_key(self, keycode: int) -> str:
        return super().press_key(keycode)

    async def start_app(self, package: str, activity: str = "") -> str:
        return super().start_app(package, activity)

    async def take_screenshot(self) -> Tuple[str, bytes]:
        return super().take_screenshot()

    async def list_packages(self, include_system_apps: bool = False) -> List[str]:
        return super().list_packages(include_system_apps)

    async def get_memory(self) -> List[str]:
        return super().get_memory()

    async def complete(self, success: bool, reason: str = "") -> None:
        return super().complete(success, reason)


async def main():
    from droidrun import load_llm
    import os

    GEMINI_API_KEY = ""

    os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

    tools = CompleteIOSTools(
        url="http://localhost:6643",
    )

    llm = load_llm("GoogleGenAI", model="gemini-2.5-flash")

    agent = DroidAgent(
        goal="Open Settings and check WiFi",
        tools=tools,
        llms=llm,  # Provide LLM instance
    )

    result = await agent.run()
    print(f"\n✅ Result: {result}")


if __name__ == "__main__":
    asyncio.run(main())

Technical Details

Dependencies

FlyingFox: HTTP server framework for Swift
XCTest: iOS testing framework for UI automation
SwiftUI: User interface framework

Server Configuration

Port: 6643 (configurable)
Protocol: HTTP/1.1
Content Types: JSON, PNG images
Threading: Async/await support

Coordinate System

Uses iOS coordinate system (points, not pixels)
Rectangle format: "{{x,y},{width,height}}"
Swipe coordinates specify starting points

Limitations

Requires iOS testing environment to run
Limited to apps accessible through XCUITest
Network access required for remote operation
Some system-level interactions may be restricted

Security Considerations

The portal provides full device automation access
Should only be used in controlled testing environments
Network access should be restricted to trusted clients
Consider implementing authentication for production use

Contributing

This project is part of the larger Droidrun automation framework. Contributions should focus on:

Enhanced UI state extraction
Additional gesture support
Improved error handling
Performance optimizations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This is the iOS portal component of the Droidrun framework. For complete automation workflows, integrate with the corresponding agent component that orchestrates automation tasks using this portal's API.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
droidrun-ios-portal.xcodeproj		droidrun-ios-portal.xcodeproj
droidrun-ios-portal		droidrun-ios-portal
droidrun-ios-portalUITests		droidrun-ios-portalUITests
static		static
LICENSE		LICENSE
README.md		README.md
device.sh		device.sh
simulator.sh		simulator.sh

License

droidrun/ios-portal

Folders and files

Latest commit

History

Repository files navigation

Overview

Architecture

Key Components

API Reference

Device Information

GET /

Vision & State Extraction

GET /vision/state

GET /vision/a11y

GET /vision/screenshot

App Management

POST /inputs/launch

Gesture Automation

POST /gestures/tap

POST /gestures/swipe

Input Automation

POST /inputs/type

POST /inputs/key

Features

UI State Extraction

Automation Capabilities

Smart Features

Usage

Prerequisites

Running the Portal

Client Integration

Example Client Usage

A fully working example

Technical Details

Dependencies

Server Configuration

Coordinate System

Limitations

Security Considerations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

GET `/`

GET `/vision/state`

GET `/vision/a11y`

GET `/vision/screenshot`

POST `/inputs/launch`

POST `/gestures/tap`

POST `/gestures/swipe`

POST `/inputs/type`

POST `/inputs/key`

Packages