Control any platform with Skills

Agent Skills are a format for extending AI coding agents with specialized capabilities. Midscene provides Agent Skills that let AI coding tools (like Claude Code, Cline, etc.) drive UI automation through CLI commands.

Skills work by running CLI commands directly in the terminal. The AI agent acts as the brain: it takes screenshots, analyzes the UI, and decides which actions to perform next.

Supported platforms

Skill	Package	CLI command	Description
Browser Automation	`@midscene/web`	`npx @midscene/web`	Browser automation with three modes: default Puppeteer headless, `--bridge` to use your own Chrome, `--cdp <ws-endpoint>` to connect via CDP
Desktop Computer Automation	`@midscene/computer`	`npx @midscene/computer`	macOS, Windows, Linux desktop control
Android Device Automation	`@midscene/android`	`npx @midscene/android`	Android device control via ADB
iOS Device Automation	`@midscene/ios`	`npx @midscene/ios`	iOS device control via WebDriverAgent
HarmonyOS Device Automation	`@midscene/harmony`	`npx @midscene/harmony`	HarmonyOS device control via HDC

In default Puppeteer mode, you can override the default 1440x800 viewport with --viewport-width <width> and --viewport-height <height>. These flags are only supported in default Puppeteer mode, not in --bridge or --cdp mode.

If Chrome is installed in a non-standard location, set MIDSCENE_CHROME_PATH to the Chrome executable path. MIDSCENE_MCP_CHROME_PATH is still accepted as a temporary migration alias.

Installation

Make sure Node.js is installed, then run:

# General installation
npx skills add web-infra-dev/midscene-skills

# Claude Code
npx skills add web-infra-dev/midscene-skills -a claude-code

# OpenClaw
npx skills add web-infra-dev/midscene-skills -a openclaw

Skills repository: github.com/web-infra-dev/midscene-skills

Model configuration

Midscene skills require a multimodal model with strong UI localization. Configure the following environment variables — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

For supported models and configuration details, see Model strategy and Common model configuration.

Use skills

Once installed, just describe the task in natural language to your AI coding agent. It picks the right Skill, runs the CLI, reads the screenshots, and decides what to do next — for example:

Open the photo app and tell me the first photo in the album.

Example: Coding Agent self-verifies after writing code

In this example, we ask Claude Code to develop an Electron Todo app, and after writing the code, it uses the desktop-computer-automation Skill to launch the app, interact with the UI, and take screenshots to verify the feature works as expected — no manual intervention or test scripts needed.

Prompt:

Build an Electron Todo app with add, toggle, and delete functionality.
After development, launch the app and verify with desktop automation: add 3 todos, check one off, delete one, and take a screenshot to confirm the final state is correct.

The coding agent autonomously completes the entire workflow: write the Todo component → launch the Electron app → connect to the desktop → take screenshots to understand the UI → interact via natural language → take screenshots to verify results. The developer only describes the intent, and Skills give the agent the ability to "see the screen and move the mouse", letting it verify its own code just like a human would.

More use cases

Skills go beyond local desktop testing. By combining different Skills, you can cover a wide range of automation scenarios:

Desktop app testing — Verify functionality of Electron, Qt, WPF and other desktop applications
Remote computer control — Operate applications on remote machines via remote desktop connections for remote ops and debugging
Mobile app testing — Use @midscene/android and @midscene/ios Skills to test mobile apps on real devices or simulators
Cross-app workflows — Chain operations across multiple apps, e.g. fetch data from browser → paste into Excel → take screenshot and send to Slack
CI/CD integration — Run desktop automation in headless mode on Linux CI via Xvfb, no physical display needed
Daily task automation — Batch form filling, scheduled screenshot monitoring, automatic file organization, etc.

Please refer to the Skills Repository for more details.

#Control any platform with Skills

#Supported platforms

#Installation

#Model configuration

#Use skills

#Example: Coding Agent self-verifies after writing code

#More use cases

#More