A comprehensive iOS automation portal that provides HTTP API access to iOS device UI state extraction and automated interactions.
The Droidrun iOS Portal is a specialized iOS application that runs UI tests to expose device automation capabilities through a RESTful HTTP API. It consists of two main components:
- Portal App (
droidrun-ios-portal): A minimal SwiftUI application that serves as the host - Portal Server (
droidrun-ios-portalUITests): XCTest-based HTTP server providing automation APIs
The portal leverages iOS XCTest framework and XCUITest capabilities to:
- Extract UI state information (accessibility trees, screenshots)
- Perform automated interactions (taps, swipes, text input)
- Launch and manage applications
- Handle device-level inputs
- DroidrunPortalServer: XCTest class that runs an HTTP server on port 6643
- DroidrunPortalHandler: HTTP route handler defining the REST API endpoints
- DroidrunPortalTools: Core automation engine implementing device interactions
- AccessibilityTree: UI state extraction and compression utilities
Returns basic device information and description.
Response:
{
"description": "Device description string"
}Retrieves current phone state including active app and keyboard status.
Response:
{
"activity": "com.example.app - Screen Title",
"keyboardShown": false
}Extracts the accessibility tree of the current UI state.
Response:
{
"accessibilityTree": "Compressed accessibility tree string"
}Captures a screenshot of the current screen.
Response: PNG image data (Content-Type: image/png)
Launches an application by bundle identifier.
Request Body:
{
"bundleIdentifier": "com.example.app"
}Response:
{
"message": "opened com.example.app"
}Performs tap gestures on screen coordinates.
Request Body:
{
"rect": "{{x,y},{width,height}}",
"count": 1,
"longPress": false
}Response:
{
"message": "tapped element"
}Performs swipe gestures from specified coordinates.
Request Body:
{
"x": 100.0,
"y": 200.0,
"dir": "up"
}Supported directions: up, down, left, right
Response:
{
"message": "swiped"
}Enters text into a focused input field.
Request Body:
{
"rect": "{{x,y},{width,height}}",
"text": "Hello World"
}Response:
{
"message": "entered text"
}Presses device hardware keys.
Request Body:
{
"key": 0
}Supported keys:
0: Home button4: Action button5: Camera button
Response:
{
"message": "pressed key"
}- Accessibility Tree: Compressed representation of the UI hierarchy with memory addresses removed
- Screenshots: PNG format screen captures
- App State: Current application context and keyboard status
- App Launching: Launch any installed app by bundle identifier
- Touch Interactions: Single taps, double taps, long presses
- Gesture Recognition: Swipe gestures in four directions
- Text Input: Automated typing with keyboard handling
- Hardware Keys: Device button presses
- App Management: Automatic app switching and state management
- Keyboard Detection: Intelligent keyboard presence detection
- Focus Management: Ensures proper element focus for text input
- Error Handling: Comprehensive error reporting and validation
- iOS device or simulator
- Xcode with XCTest capabilities
- Network access to the device
- Build and run the portal app on the target iOS device
- The XCTest suite will automatically start the HTTP server on port 6643
- The server will continue running until the test session ends
The portal is designed to work with automation agents that can:
- Send HTTP requests to the portal endpoints
- Process accessibility tree data for UI understanding
- Coordinate multiple automation actions
- Handle screenshot analysis for visual verification
import requests
# Get device info
response = requests.get('http://device-ip:6643/')
device_info = response.json()
# Take screenshot
screenshot = requests.get('http://device-ip:6643/vision/screenshot')
with open('screenshot.png', 'wb') as f:
f.write(screenshot.content)
# Get accessibility tree
a11y = requests.get('http://device-ip:6643/vision/a11y').json()
print(a11y['accessibilityTree'])
# Launch app
requests.post('http://device-ip:6643/inputs/launch',
json={'bundleIdentifier': 'com.apple.mobilesafari'})
# Perform tap
requests.post('http://device-ip:6643/gestures/tap',
json={'rect': '{{100,200},{50,50}}', 'count': 1})import asyncio
from typing import List, Dict, Any, Tuple
from droidrun import IOSTools, DroidAgent
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CompleteIOSTools(IOSTools):
"""Complete implementation of IOSTools with all required abstract methods."""
def _set_context(self, ctx):
"""Set the workflow context (required by DroidAgent)."""
self._ctx = ctx
async def get_date(self) -> str:
"""Get the current date and time on iOS device."""
try:
import requests
date_url = f"{self.url}/system/date"
response = requests.get(date_url)
if response.status_code == 200:
return response.json().get("date", "Unknown")
else:
# Fallback to returning current system time
import datetime
return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
except Exception:
import datetime
return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
async def get_apps(self, include_system: bool = True) -> List[Dict[str, Any]]:
"""Get installed apps with bundle identifier and name."""
# Use await because we override list_packages to be async
packages = await self.list_packages(include_system_apps=include_system)
# Convert to format expected by the method
return [{"package": pkg, "label": pkg.split(".")[-1]} for pkg in packages]
def _extract_element_coordinates_by_index(self, index: int) -> Tuple[int, int]:
"""Extract center coordinates from an element by its index."""
if not self.clickable_elements_cache:
raise ValueError("No UI elements cached. Call get_state first.")
# Find element with the given index
for element in self.clickable_elements_cache:
if element.get("index") == index:
center_x = element.get("center_x")
center_y = element.get("center_y")
if center_x is not None and center_y is not None:
return (int(center_x), int(center_y))
raise ValueError(f"No element found with index {index}")
async def input_text(self, text: str, index: int = -1, clear: bool = False) -> str:
"""
Input text on the iOS device.
Args:
text: Text to input. Can contain spaces, newlines, and special characters including non-ASCII.
index: Element index to input text into (optional, -1 means use last tapped element)
clear: Whether to clear existing text before input (not currently supported for iOS)
Returns:
Result message
"""
try:
import requests
import time
# If index is provided and valid, tap on that element first
if index >= 0:
await self.tap_by_index(index)
# Note: clear parameter is not currently supported by iOS portal API
# Future enhancement could add support for clearing text
# Use the last tapped element's rect if available, otherwise use a default
rect = self.last_tapped_rect if self.last_tapped_rect else "0,0,100,100"
type_url = f"{self.url}/inputs/type"
payload = {"rect": rect, "text": text}
response = requests.post(type_url, json=payload)
if response.status_code == 200:
time.sleep(0.5) # Wait for text input to complete
return f"Text input completed: {text[:50]}{'...' if len(text) > 50 else ''}"
else:
return f"Error: Failed to input text. HTTP {response.status_code}"
except Exception as e:
return f"Error sending text input: {str(e)}"
async def tap_on_index(self, index: int) -> str:
"""Alias for tap_by_index."""
return await self.tap_by_index(index)
def _format_elements(self, elements: List[Dict[str, Any]]) -> str:
"""Format elements for LLM consumption."""
lines = []
for elem in elements:
idx = elem.get("index")
text = elem.get("text", "")
type_ = elem.get("type", "")
lines.append(f"[{idx}] {type_} '{text}'")
return "\n".join(lines)
# Overrides for IOSTools sync methods to make them async for DroidAgent compatibility
async def get_state(self):
# Call parent sync method
state_dict = super().get_state()
a11y_tree = state_dict.get("a11y_tree", [])
phone_state = state_dict.get("phone_state", {})
formatted_text = self._format_elements(a11y_tree)
focused_text = "" # No focus info for now
return formatted_text, focused_text, a11y_tree, phone_state
async def tap_by_index(self, index: int) -> str:
return super().tap_by_index(index)
async def swipe(
self, start_x: int, start_y: int, end_x: int, end_y: int, duration_ms: int = 300
) -> bool:
return super().swipe(start_x, start_y, end_x, end_y, duration_ms)
async def drag(
self,
start_x: int,
start_y: int,
end_x: int,
end_y: int,
duration_ms: int = 3000,
) -> bool:
return super().drag(start_x, start_y, end_x, end_y, duration_ms)
async def back(self) -> str:
try:
return super().back()
except NotImplementedError:
return "Error: Back button is not supported on iOS"
async def press_key(self, keycode: int) -> str:
return super().press_key(keycode)
async def start_app(self, package: str, activity: str = "") -> str:
return super().start_app(package, activity)
async def take_screenshot(self) -> Tuple[str, bytes]:
return super().take_screenshot()
async def list_packages(self, include_system_apps: bool = False) -> List[str]:
return super().list_packages(include_system_apps)
async def get_memory(self) -> List[str]:
return super().get_memory()
async def complete(self, success: bool, reason: str = "") -> None:
return super().complete(success, reason)
async def main():
from droidrun import load_llm
import os
GEMINI_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY
tools = CompleteIOSTools(
url="http://localhost:6643",
)
llm = load_llm("GoogleGenAI", model="gemini-2.5-flash")
agent = DroidAgent(
goal="Open Settings and check WiFi",
tools=tools,
llms=llm, # Provide LLM instance
)
result = await agent.run()
print(f"\n✅ Result: {result}")
if __name__ == "__main__":
asyncio.run(main())
- FlyingFox: HTTP server framework for Swift
- XCTest: iOS testing framework for UI automation
- SwiftUI: User interface framework
- Port: 6643 (configurable)
- Protocol: HTTP/1.1
- Content Types: JSON, PNG images
- Threading: Async/await support
- Uses iOS coordinate system (points, not pixels)
- Rectangle format:
"{{x,y},{width,height}}" - Swipe coordinates specify starting points
- Requires iOS testing environment to run
- Limited to apps accessible through XCUITest
- Network access required for remote operation
- Some system-level interactions may be restricted
- The portal provides full device automation access
- Should only be used in controlled testing environments
- Network access should be restricted to trusted clients
- Consider implementing authentication for production use
This project is part of the larger Droidrun automation framework. Contributions should focus on:
- Enhanced UI state extraction
- Additional gesture support
- Improved error handling
- Performance optimizations
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This is the iOS portal component of the Droidrun framework. For complete automation workflows, integrate with the corresponding agent component that orchestrates automation tasks using this portal's API.