Automate with scripts in YAML

In most cases, developers write automation scripts just to perform some smoke tests, like checking for the appearance of certain content or verifying that a key user path is accessible. In such situations, maintaining a large test project is unnecessary.

Midscene offers a way to perform automation using .yaml files, which helps you focus on the script itself rather than the testing framework. This allows any team member to write automation scripts without needing to learn any API.

Here is an example. By reading its content, you should be able to understand how it works.

web:
  url: https://www.bing.com

tasks:
  - name: Search for weather
    flow:
      - ai: Search for "today's weather"
      - sleep: 3000

  - name: Check results
    flow:
      - aiAssert: The results show weather information
Sample Project

You can find a sample project that uses YAML scripts for automation here:

Set up API keys for model

Set your model configs into the environment variables. You may refer to Model strategy for more details.

export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"

For more configuration details, please refer to Model strategy and Model configuration.

To execute YAML workflows from the command line, install the Midscene CLI. See Command line tools for installation guidance, .env usage, and details on the midscene runner.

Script file structure

Script files use YAML format to describe automation tasks. It defines the target to be manipulated (like a webpage or an Android app) and the series of steps to perform.

A standard .yaml script file includes a web, android, or ios section to configure the environment, an optional agent section to configure AI agent behavior, and a tasks section to define the automation tasks.

web:
  url: https://www.bing.com

# The tasks section defines the series of steps to be executed
tasks:
  - name: Search for weather
    flow:
      - ai: Search for "today's weather"
      - sleep: 3000
      - aiAssert: The results show weather information

The agent part

The agent section configures AI agent behavior and test report options. All fields are optional.

# AI agent configuration
agent:
  # Test identifier, used for reporting and cache identification, optional
  testId: <string>

  # Report group name, optional
  groupName: <string>

  # Report group description, optional
  groupDescription: <string>

  # Whether to generate test reports, optional, defaults to true
  generateReport: <boolean>

  # Whether to automatically print report messages, optional, defaults to true
  autoPrintReportMsg: <boolean>

  # Custom report file name, optional
  reportFileName: <string>

  # Maximum AI replanning cycle limit, optional, defaults to 20 (40 for UI-TARS model)
  replanningCycleLimit: <number>

  # Background knowledge to send to the AI model when calling aiAct, optional
  aiActContext: <string>
  # Legacy alias (aiActionContext) remains for backward compatibility, but avoid using it in new scripts

  # Cache configuration, optional
  cache:
    # Cache strategy, optional, values: 'read-only' | 'read-write' | 'write-only'
    strategy: <string>
    # Cache ID, required
    id: <string>
Agent Configuration Notes
  • Applicable environments: Web, iOS, and Android environments all support agent configuration
  • testId priority: CLI parameter > YAML agent.testId > filename
  • aiActContext: Provides background knowledge to the AI model, like how to handle popups, business introduction, etc. A legacy alias remains for backward compatibility (see inline comment) but should not be used in new scripts.
  • Cache configuration: For detailed usage, refer to the Caching documentation

Usage example

# agent configuration, applies to all environments
agent:
  testId: "checkout-test"
  groupName: "E2E Test Suite"
  groupDescription: "Complete checkout flow testing"
  generateReport: true
  autoPrintReportMsg: false
  reportFileName: "checkout-report"
  replanningCycleLimit: 30
  aiActContext: "If any popup appears, click agree. If login page appears, skip it."
  cache:
    id: "checkout-cache"
    strategy: "read-write"

# iOS environment configuration
ios:
  launch: https://www.bing.com
  wdaPort: 8100

# Or Android environment configuration
android:
  deviceId: s4ey59
  launch: https://www.bing.com

tasks:
  - name: Search for weather
    flow:
      - ai: Search for "today's weather"
      - aiAssert: The results show weather information

The web part

web:
  # The URL to visit, required. If `serve` is provided, provide the relative path.
  url: <url>

  # Serve a local path as a static server, optional.
  serve: <root-directory>

  # The browser user agent, optional.
  userAgent: <ua>

  # The browser viewport width, optional, defaults to 1280.
  viewportWidth: <width>

  # The browser viewport height, optional, defaults to 960.
  viewportHeight: <height>

  # The browser's device pixel ratio, optional, defaults to 1.
  deviceScaleFactor: <scale>

  # Path to a JSON format browser cookie file, optional.
  cookie: <path-to-cookie-file>

  # The strategy for waiting for network idle, optional.
  waitForNetworkIdle:
    # The timeout in milliseconds, optional, defaults to 2000ms.
    timeout: <ms>
    # Whether to continue on timeout, optional, defaults to true.
    continueOnNetworkIdleError: <boolean>

  # The path to the JSON file for outputting aiQuery/aiAssert results, optional.
  output: <path-to-output-file>

  # Whether to save log content to a JSON file, optional, defaults to `false`. If true, saves to `unstableLogContent.json`. If a string, saves to the specified path. The log content structure may change in the future.
  unstableLogContent: <boolean | path-to-unstable-log-file>

  # Whether to restrict page navigation to the current tab, optional, defaults to true.
  forceSameTabNavigation: <boolean>

  # The bridge mode, optional, defaults to false. Can be 'newTabWithUrl' or 'currentTab'. See below for more details.
  bridgeMode: false | 'newTabWithUrl' | 'currentTab'

  # Whether to close newly created tabs when the bridge disconnects, optional, defaults to false.
  closeNewTabsAfterDisconnect: <boolean>

  # Whether to ignore HTTPS certificate errors, optional, defaults to false.
  acceptInsecureCerts: <boolean>

  # Custom Chrome launch arguments (Puppeteer only, not supported in bridge mode), optional.
  # Use this to customize Chrome browser behavior, such as disabling third-party cookie blocking.
  # ⚠️ Security Warning: Some arguments (e.g., --no-sandbox, --disable-web-security) may reduce browser security.
  # Use only in controlled testing environments.
  chromeArgs:
    - '--disable-features=ThirdPartyCookiePhaseout'
    - '--disable-features=SameSiteByDefaultCookies'
    - '--window-size=1920,1080'

The android part

android:
  # The device ID, optional, defaults to the first connected device.
  deviceId: <device-id>

  # The launch URL, optional, defaults to the device's current page.
  launch: <url>

  # The path to the JSON file for outputting aiQuery/aiAssert results, optional.
  output: <path-to-output-file>

  # All other options supported by the AndroidDevice constructor
  # For example: androidAdbPath, remoteAdbHost, remoteAdbPort,
  # imeStrategy, displayId, autoDismissKeyboard, keyboardDismissStrategy,
  # screenshotResizeScale, alwaysRefreshScreenInfo, etc.
  # See the AndroidDevice constructor documentation for the complete list
View Complete Android Configuration Options

YAML scripts now support all configuration options from the AndroidDevice constructor. For the complete list of options, please refer to AndroidDevice Constructor in Android Integration Documentation.

Android Platform-Specific Actions

runAdbShell - Execute ADB Shell Commands

Execute ADB shell commands on Android devices.

android:
  deviceId: 'test-device'

tasks:
  - name: Clear app data
    flow:
      - runAdbShell: 'pm clear com.example.app'

  - name: Get battery info
    flow:
      - runAdbShell: 'dumpsys battery'

Common ADB Shell Commands:

  • pm clear <package> - Clear app data
  • dumpsys battery - Get battery information
  • dumpsys window - Get window information
  • settings get secure android_id - Get device ID
  • input keyevent <keycode> - Send key events

launch - Launch App or URL

Launch Android apps or open URLs.

android:
  deviceId: 'test-device'

tasks:
  - name: Launch Settings app
    flow:
      - launch:
          uri: com.android.settings

  - name: Open webpage
    flow:
      - launch:
          uri: https://www.example.com

The ios part

ios:
  # WebDriverAgent port, optional, defaults to 8100.
  wdaPort: <port>

  # WebDriverAgent host address, optional, defaults to localhost.
  wdaHost: <host>

  # Whether to auto dismiss keyboard, optional, defaults to false.
  autoDismissKeyboard: <boolean>

  # Launch URL or app bundle ID, optional, defaults to the device's current page.
  launch: <url-or-bundle-id>

  # The path to the JSON file for outputting aiQuery/aiAssert results, optional.
  output: <path-to-output-file>

  # Whether to save log content to a JSON file, optional, defaults to `false`. If true, saves to `unstableLogContent.json`. If a string, saves to the specified path. The log content structure may change in the future.
  unstableLogContent: <boolean | path-to-unstable-log-file>

  # All other options supported by the IOSDevice constructor
  # See the IOSDevice constructor documentation for the complete list
View Complete iOS Configuration Options

YAML scripts now support all configuration options from the IOSDevice constructor. For the complete list of options, please refer to IOSDevice constructor in the iOS API reference.

iOS Platform-Specific Actions

runWdaRequest - Execute WebDriverAgent API Requests

Execute WebDriverAgent API requests directly on iOS devices.

ios:
  launch: 'com.apple.mobilesafari'

tasks:
  - name: Press home button via WDA
    flow:
      - runWdaRequest:
          method: POST
          endpoint: /session/test/wda/pressButton
          data:
            name: home

  - name: Get device information
    flow:
      - runWdaRequest:
          method: GET
          endpoint: /wda/device/info

Parameters:

  • method (string, required): HTTP method (GET, POST, DELETE, etc.)
  • endpoint (string, required): WebDriverAgent API endpoint
  • data (any, optional): Request body data

Common WebDriverAgent Endpoints:

  • /wda/screen - Get screen information
  • /wda/device/info - Get device information
  • /session/{sessionId}/wda/pressButton - Press hardware buttons
  • /session/{sessionId}/wda/apps/launch - Launch apps
  • /session/{sessionId}/wda/apps/activate - Activate apps

launch - Launch App or URL

Launch iOS apps or open URLs.

ios:
  wdaPort: 8100

tasks:
  - name: Launch Settings app
    flow:
      - launch:
          uri: com.apple.Preferences

  - name: Open webpage
    flow:
      - launch:
          uri: https://www.example.com

The tasks part

The tasks part is an array that defines the steps of the script. Remember to add a - before each step to indicate it's an array item.

The interfaces in the flow section are almost identical to the API, with some differences in parameter nesting levels.

tasks:
  - name: <name>
    continueOnError: <boolean> # Optional, whether to continue to the next task on error, defaults to false.
    flow:
      # Auto Planning (.ai)
      # ----------------

      # Perform an interaction. `ai` is a shorthand for `aiAct`.
      - ai: <prompt>
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # This usage is the same as `ai`.
      # Note: In earlier versions, this was also written as `aiAction`. The current version supports both names.
      - aiAct: <prompt>
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Instant Action (.aiTap, .aiHover, .aiInput, .aiKeyboardPress, .aiScroll)
      # ----------------

      # Tap an element described by a prompt.
      - aiTap: <prompt>
        deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
        xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Hover over an element described by a prompt.
      - aiHover: <prompt>
        deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
        xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Input text into an element described by a prompt.
      - aiInput: <final text content of the input>
        locate: <prompt>
        deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
        xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Press a key (e.g., Enter, Tab, Escape) on an element described by a prompt.
      - aiKeyboardPress: <key>
        locate: <prompt>
        deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
        xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Scroll globally or on an element described by a prompt.
      - aiScroll:
        direction: 'down' # or 'up' | 'left' | 'right'. Defaults to 'down'.
        scrollType: 'singleAction' # or 'scrollToBottom' | 'scrollToTop' | 'scrollToRight' | 'scrollToLeft'. Defaults to 'singleAction'.
        distance: <number> # Optional, the scroll distance in pixels. Use null to let Midscene decide automatically.
        locate: <prompt> # Optional, the element to scroll on.
        deepThink: <boolean> # Optional, whether to use deepThink to precisely locate the element. Defaults to False.
        xpath: <xpath> # Optional, the xpath of the target element for the operation. If provided, Midscene will prioritize this xpath to find the element before using the cache and the AI model. Defaults to empty.
        cacheable: <boolean> # Optional, whether to cache the result of this API call when the [caching feature](./caching.mdx) is enabled. Defaults to True.

      # Log the current screenshot with a description in the report file.
      - recordToReport: <title> # Optional, the title of the screenshot. If not provided, the title will be 'untitled'.
        content: <content> # Optional, the description of the screenshot.

      # Data Extraction
      # ----------------

      # Perform a query that returns a JSON object.
      - aiQuery: <prompt> # Remember to describe the format of the result in the prompt.
        name: <name> # The key for the query result in the JSON output.

      # More APIs
      # ----------------

      # Wait for a condition to be met, with a timeout (in ms, optional, defaults to 30000).
      - aiWaitFor: <prompt>
        timeout: <ms>

      # Perform an assertion.
      - aiAssert: <prompt>
        errorMessage: <error-message> # Optional, the error message to print if the assertion fails.
        name: <name> # Optional, give the assertion a name, which will be used as a key in the JSON output.

      # Wait for a specified amount of time.
      - sleep: <ms>

      # Execute a piece of JavaScript code in the web page context.
      - javascript: <javascript>
        name: <name> # Optional, assign a name to the return value, which will be used as a key in the JSON output.

  - name: <name>
    flow:
      # ...

Prompting with images

For steps whose prompt accepts images, you can attach images to the prompt by setting the images field to an array of objects, each containing a name and a url. (see the API reference), replace the string value with an object that contains:

  • prompt: The text prompt.
  • images: (Optional) The reference images used in the prompt. Each image needs a name and a url.
  • convertHttpImage2Base64: (Optional) Converts HTTP image links to Base64 before sending them to the model, which is useful when the link is not publicly accessible.

Image URLs can be local paths, Base64 strings, or remote links. When using image links that cannot be accessed for the model, set convertHttpImage2Base64: true so Midscene will download the image and send the base64 string to the model.

For interactions like aiTap, aiHover, aiDoubleClick, aiRightClick, put the text and images in the locate field.

tasks:
  - name: Verify branding
    flow:
      - aiHover:
          locate:
            prompt: Move the cursor to the region containing the GitHub logo.
            images:
              - name: GitHub logo
                url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
            convertHttpImage2Base64: true

      - aiTap:
          locate:
            prompt: Tap the region containing the GitHub logo.
            images:
              - name: GitHub logo
                url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
            convertHttpImage2Base64: true

For insight steps like aiAsk, aiQuery, aiBoolean, aiNumber, aiString, and aiAssert, you can set the prompt and images fields directly.

tasks:
  - name: Verify branding
    flow:
      - aiAssert:
          prompt: Check whether the image appears on the page.
          images:
            - name: target logo
              url: https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png
          convertHttpImage2Base64: true