Automate with scripts in YAML
In most cases, developers write automation scripts just to perform some smoke tests, like checking for the appearance of certain content or verifying that a key user path is accessible. In such situations, maintaining a large test project is unnecessary.
Midscene offers a way to perform automation using .yaml files, which helps you focus on the script itself rather than the testing framework. This allows any team member to write automation scripts without needing to learn any API.
Here is an example. By reading its content, you should be able to understand how it works.
web:
url: https://www.bing.com
tasks:
- name: Search for weather
flow:
- ai: Search for "today's weather"
- sleep: 3000
- name: Check results
flow:
- aiAssert: The results show weather information
Sample Project
You can find a sample project that uses YAML scripts for automation here:
Set up API keys for model
Set your model configs into the environment variables. You may refer to Model strategy for more details.
export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"
For more configuration details, please refer to Model strategy and Model configuration.
To execute YAML workflows from the command line, install the Midscene CLI. See Command line tools for installation guidance, .env usage, and details on the midscene runner.
Script file structure
Script files use YAML format to describe automation tasks. It defines the target to be manipulated (like a webpage or an Android app) and the series of steps to perform.
A standard .yaml script file includes a web, android, or ios section to configure the environment, an optional agent section to configure AI agent behavior, and a tasks section to define the automation tasks.
web:
url: https://www.bing.com
# The tasks section defines the series of steps to be executed
tasks:
- name: Search for weather
flow:
- ai: Search for "today's weather"
- sleep: 3000
- aiAssert: The results show weather information
The agent part
The agent section configures AI agent behavior and test report options. All fields are optional.
# AI agent configuration
agent:
# Test identifier, used for reporting and cache identification, optional
testId: <string>
# Report group name, optional
groupName: <string>
# Report group description, optional
groupDescription: <string>
# Whether to generate test reports, optional, defaults to true
generateReport: <boolean>
# Whether to automatically print report messages, optional, defaults to true
autoPrintReportMsg: <boolean>
# Custom report file name, optional
reportFileName: <string>
# Maximum AI replanning cycle limit, optional, defaults to 20 (40 for UI-TARS model)
replanningCycleLimit: <number>
# Background knowledge to send to the AI model when calling aiAct, optional
aiActContext: <string>
# Legacy alias (aiActionContext) remains for backward compatibility, but avoid using it in new scripts
# Cache configuration, optional
cache:
# Cache strategy, optional, values: 'read-only' | 'read-write' | 'write-only'
strategy: <string>
# Cache ID, required
id: <string>
Agent Configuration Notes
- Applicable environments: Web, iOS, and Android environments all support
agent configuration
- testId priority: CLI parameter > YAML agent.testId > filename
- aiActContext: Provides background knowledge to the AI model, like how to handle popups, business introduction, etc. A legacy alias remains for backward compatibility (see inline comment) but should not be used in new scripts.
- Cache configuration: For detailed usage, refer to the Caching documentation
Usage example
# agent configuration, applies to all environments
agent:
testId: "checkout-test"
groupName: "E2E Test Suite"
groupDescription: "Complete checkout flow testing"
generateReport: true
autoPrintReportMsg: false
reportFileName: "checkout-report"
replanningCycleLimit: 30
aiActContext: "If any popup appears, click agree. If login page appears, skip it."
cache:
id: "checkout-cache"
strategy: "read-write"
# iOS environment configuration
ios:
launch: https://www.bing.com
wdaPort: 8100
# Or Android environment configuration
android:
deviceId: s4ey59
launch: https://www.bing.com
tasks:
- name: Search for weather
flow:
- ai: Search for "today's weather"
- aiAssert: The results show weather information
The web part
web:
# The URL to visit, required. If `serve` is provided, provide the relative path.
url: <url>
# Serve a local path as a static server, optional.
serve: <root-directory>
# The browser user agent, optional.
userAgent: <ua>
# The browser viewport width, optional, defaults to 1280.
viewportWidth: <width>
# The browser viewport height, optional, defaults to 960.
viewportHeight: <height>
# The browser's device pixel ratio, optional, defaults to 1.
deviceScaleFactor: <scale>
# Path to a JSON format browser cookie file, optional.
cookie: <path-to-cookie-file>
# The strategy for waiting for network idle, optional.
waitForNetworkIdle:
# The timeout in milliseconds, optional, defaults to 2000ms.
timeout: <ms>
# Whether to continue on timeout, optional, defaults to true.
continueOnNetworkIdleError: <boolean>
# The path to the JSON file for outputting aiQuery/aiAssert results, optional.
output: <path-to-output-file>
# Whether to save log content to a JSON file, optional, defaults to `false`. If true, saves to `unstableLogContent.json`. If a string, saves to the specified path. The log content structure may change in the future.
unstableLogContent: <boolean | path-to-unstable-log-file>
# Whether to restrict page navigation to the current tab, optional, defaults to true.
forceSameTabNavigation: <boolean>
# The bridge mode, optional, defaults to false. Can be 'newTabWithUrl' or 'currentTab'. See below for more details.
bridgeMode: false | 'newTabWithUrl' | 'currentTab'
# Whether to close newly created tabs when the bridge disconnects, optional, defaults to false.
closeNewTabsAfterDisconnect: <boolean>
# Whether to ignore HTTPS certificate errors, optional, defaults to false.
acceptInsecureCerts: <boolean>
# Custom Chrome launch arguments (Puppeteer only, not supported in bridge mode), optional.
# Use this to customize Chrome browser behavior, such as disabling third-party cookie blocking.
# ⚠️ Security Warning: Some arguments (e.g., --no-sandbox, --disable-web-security) may reduce browser security.
# Use only in controlled testing environments.
chromeArgs:
- '--disable-features=ThirdPartyCookiePhaseout'
- '--disable-features=SameSiteByDefaultCookies'
- '--window-size=1920,1080'
The android part
android:
# The device ID, optional, defaults to the first connected device.
deviceId: <device-id>
# The launch URL, optional, defaults to the device's current page.
launch: <url>
# The path to the JSON file for outputting aiQuery/aiAssert results, optional.
output: <path-to-output-file>
# All other options supported by the AndroidDevice constructor
# For example: androidAdbPath, remoteAdbHost, remoteAdbPort,
# imeStrategy, displayId, autoDismissKeyboard, keyboardDismissStrategy,
# screenshotResizeScale, alwaysRefreshScreenInfo, etc.
# See the AndroidDevice constructor documentation for the complete list
runAdbShell - Execute ADB Shell Commands
Execute ADB shell commands on Android devices.
android:
deviceId: 'test-device'
tasks:
- name: Clear app data
flow:
- runAdbShell: 'pm clear com.example.app'
- name: Get battery info
flow:
- runAdbShell: 'dumpsys battery'
Common ADB Shell Commands:
pm clear <package> - Clear app data
dumpsys battery - Get battery information
dumpsys window - Get window information
settings get secure android_id - Get device ID
input keyevent <keycode> - Send key events
launch - Launch App or URL
Launch Android apps or open URLs.
android:
deviceId: 'test-device'
tasks:
- name: Launch Settings app
flow:
- launch:
uri: com.android.settings
- name: Open webpage
flow:
- launch:
uri: https://www.example.com
The ios part
ios:
# WebDriverAgent port, optional, defaults to 8100.
wdaPort: <port>
# WebDriverAgent host address, optional, defaults to localhost.
wdaHost: <host>
# Whether to auto dismiss keyboard, optional, defaults to false.
autoDismissKeyboard: <boolean>
# Launch URL or app bundle ID, optional, defaults to the device's current page.
launch: <url-or-bundle-id>
# The path to the JSON file for outputting aiQuery/aiAssert results, optional.
output: <path-to-output-file>
# Whether to save log content to a JSON file, optional, defaults to `false`. If true, saves to `unstableLogContent.json`. If a string, saves to the specified path. The log content structure may change in the future.
unstableLogContent: <boolean | path-to-unstable-log-file>
# All other options supported by the IOSDevice constructor
# See the IOSDevice constructor documentation for the complete list
runWdaRequest - Execute WebDriverAgent API Requests
Execute WebDriverAgent API requests directly on iOS devices.
ios:
launch: 'com.apple.mobilesafari'
tasks:
- name: Press home button via WDA
flow:
- runWdaRequest:
method: POST
endpoint: /session/test/wda/pressButton
data:
name: home
- name: Get device information
flow:
- runWdaRequest:
method: GET
endpoint: /wda/device/info
Parameters:
method (string, required): HTTP method (GET, POST, DELETE, etc.)
endpoint (string, required): WebDriverAgent API endpoint
data (any, optional): Request body data
Common WebDriverAgent Endpoints:
/wda/screen - Get screen information
/wda/device/info - Get device information
/session/{sessionId}/wda/pressButton - Press hardware buttons
/session/{sessionId}/wda/apps/launch - Launch apps
/session/{sessionId}/wda/apps/activate - Activate apps
launch - Launch App or URL
Launch iOS apps or open URLs.
ios:
wdaPort: 8100
tasks:
- name: Launch Settings app
flow:
- launch:
uri: com.apple.Preferences
- name: Open webpage
flow:
- launch:
uri: https://www.example.com
The tasks part
The tasks part is an array that defines the steps of the script. Remember to add a - before each step to indicate it's an array item.
The interfaces in the flow section are almost identical to the API, with some differences in parameter nesting levels.
tasks:
- name: <name>
continueOnError: <boolean> # Optional, whether to continue to the next task on error, defaults to false.
flow:
# Auto Planning (.ai)
# ----------------
# Perform an interaction. `ai` is a shorthand for `aiAct`.
- ai: <prompt>
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# This usage is the same as `ai`.
# Note: In earlier versions, this was also written as `aiAction`. The current version supports both names.
- aiAct: <prompt>
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Instant Action (.aiTap, .aiHover, .aiInput, .aiKeyboardPress, .aiScroll)
# ----------------
# Tap an element described by a prompt.
- aiTap: <prompt>
deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Hover over an element described by a prompt.
- aiHover: <prompt>
deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Input text into an element described by a prompt.
- aiInput: <final text content of the input>
locate: <prompt>
deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Press a key (e.g., Enter, Tab, Escape) on an element described by a prompt.
- aiKeyboardPress: <key>
locate: <prompt>
deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Scroll globally or on an element described by a prompt.
- aiScroll:
direction: 'down' # or 'up' | 'left' | 'right'. Defaults to 'down'.
scrollType: 'singleAction' # or 'scrollToBottom' | 'scrollToTop' | 'scrollToRight' | 'scrollToLeft'. Defaults to 'singleAction'.
distance: <number> # Optional, the scroll distance in pixels. Use null to let Midscene decide automatically.
locate: <prompt> # Optional, the element to scroll on.
deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.
# Log the current screenshot with a description in the report file.
- recordToReport: <title> # Optional, the title of the screenshot. If not provided, the title will be 'untitled'.
content: <content> # Optional, the description of the screenshot.
# Data Extraction
# ----------------
# Perform a query that returns a JSON object.
- aiQuery: <prompt> # Remember to describe the format of the result in the prompt.
name: <name> # The key for the query result in the JSON output.
# More APIs
# ----------------
# Wait for a condition to be met, with a timeout (in ms, optional, defaults to 30000).
- aiWaitFor: <prompt>
timeout: <ms>
# Perform an assertion.
- aiAssert: <prompt>
errorMessage: <error-message> # Optional, the error message to print if the assertion fails.
name: <name> # Optional, give the assertion a name, which will be used as a key in the JSON output.
# Wait for a specified amount of time.
- sleep: <ms>
# Execute a piece of JavaScript code in the web page context.
- javascript: <javascript>
name: <name> # Optional, assign a name to the return value, which will be used as a key in the JSON output.
- name: <name>
flow:
# ...
Prompting with images
For steps whose prompt accepts images, you can attach images to the prompt by setting the images field to an array of objects, each containing a name and a url. (see the API reference), replace the string value with an object that contains:
prompt: The text prompt.
images: (Optional) The reference images used in the prompt. Each image needs a name and a url.
convertHttpImage2Base64: (Optional) Converts HTTP image links to Base64 before sending them to the model, which is useful when the link is not publicly accessible.
Image URLs can be local paths, Base64 strings, or remote links. When using image links that cannot be accessed for the model, set convertHttpImage2Base64: true so Midscene will download the image and send the base64 string to the model.
For interactions like aiTap, aiHover, aiDoubleClick, aiRightClick, put the text and images in the locate field.
tasks:
- name: Verify branding
flow:
- aiHover:
locate:
prompt: Move the cursor to the region containing the GitHub logo.
images:
- name: GitHub logo
url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
convertHttpImage2Base64: true
- aiTap:
locate:
prompt: Tap the region containing the GitHub logo.
images:
- name: GitHub logo
url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
convertHttpImage2Base64: true
For insight steps like aiAsk, aiQuery, aiBoolean, aiNumber, aiString, and aiAssert, you can set the prompt and images fields directly.
tasks:
- name: Verify branding
flow:
- aiAssert:
prompt: Check whether the image appears on the page.
images:
- name: target logo
url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
convertHttpImage2Base64: true