# API reference (Common)

## Constructors

Midscene ships multiple agents tuned for specific automation environments. Every constructor takes the target page/device plus a shared options bag (reporting, caching, AI configuration, hooks), and then layers on platform-only switches such as navigation guards in browsers or ADB wiring on Android. Use the sections below to review the import path and platform-specific parameters for each agent:

* In Puppeteer, use [PuppeteerAgent](/web-api-reference.md#puppeteer-agent)
* In Playwright, use [PlaywrightAgent](/web-api-reference.md#playwright-agent)
* In Bridge Mode, use [AgentOverChromeBridge](/web-api-reference.md#chrome-bridge-agent)
* On Android, use [Android API reference](/android-api-reference.md)
* On iOS, use [iOS API reference](/ios-api-reference.md)
* For GUI agents integrating with your own interface, refer to [Custom Interface Agent](/integrate-with-any-interface.md)

### Parameters

All agents share these base options:

* `generateReport: boolean`: If true, a report file will be generated. (Default: true)
* `reportFileName: string`: The name of the report file. (Default: generated by midscene)
* `autoPrintReportMsg: boolean`: If true, report messages will be printed. (Default: true)
* `cacheId: string | undefined`: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)
* `aiActContext: string`: Some background knowledge that should be sent to the AI model when calling `agent.aiAct()`, like 'close the cookie consent dialog first if it exists' (Default: undefined). Previously exposed as `aiActionContext`; the legacy name is still accepted for backward compatibility.
* `replanningCycleLimit: number`: The maximum number of `aiAct` replanning cycles. Default is 20 (40 for UI-TARS models). Prefer setting this via the agent option; reading `MIDSCENE_REPLANNING_CYCLE_LIMIT` is only for backward compatibility.
* `waitAfterAction: number`: Wait time in milliseconds after each action execution. This allows the UI to settle and stabilize before the next action. Default is 300ms.
* `onTaskStartTip: (tip: string) => void | Promise<void>`: Optional hook that fires before each execution task begins with a human-readable summary of the task (Default: undefined)
* `outputFormat: 'single-html' | 'html-and-external-assets'`: Controls how the report is generated. `'single-html'` (default) embeds all screenshots as base64 in a single HTML file. `'html-and-external-assets'` saves screenshots as separate PNG files in a subdirectory, useful when report files become too large. **Note**: When using `'html-and-external-assets'`, the report must be accessed via an HTTP server or CDN URL - it cannot be opened directly using the `file://` protocol. This is because browser CORS (Cross-Origin Resource Sharing) restrictions prevent loading local images from relative paths when using the file protocol. For local testing, start a simple HTTP server in the report directory. Navigate to the report directory and run one of these commands:
  * Using Node.js: `npx serve`
  * Using Python: `python -m http.server` or `python3 -m http.server`
    Then access the report via `http://localhost:3000` (or the port shown in the terminal).
* `screenshotShrinkFactor: number`: Controls the scaling ratio of screenshots to reduce the image size sent to the AI model, thereby reducing token consumption. The default value is 1 (no scaling). If set to 2, the width and height of the screenshot will be halved, and the area will be reduced to a quarter of the original. You can adjust this value based on your actual situation to find the best balance between image clarity and token consumption.
  * For mobile devices, setting `screenshotShrinkFactor` to 2 can reduce token consumption while maintaining clarity, but it is not recommended to set it higher than 3, as this may cause the image to be too blurry and affect the AI model's understanding.
  * For web pages, if the content is complex or contains a lot of details, it is not recommended to set `screenshotShrinkFactor` to avoid overly blurry screenshots. Additionally, if you want higher clarity for web page screenshots, you can configure Puppeteer or Playwright's `deviceScaleFactor` to 2, which will allow Puppeteer or Playwright to render the page as if it were a high-definition screen.

### Custom model configuration

Use `modelConfig: Record<string, string | number>` to configure models directly in code instead of environment variables.

> If `modelConfig` is provided at agent initialization, **all system environment variables for model config are ignored**. Only the values in this object are used.
> The keys/values you can set here are the same ones listed in [Model configuration](/model-config.md). You can also see explanation in [Model strategy](/model-strategy.md).

**Basic Example (single model for all intents):**

```typescript
const agent = new PuppeteerAgent(page, {
  modelConfig: {
    MIDSCENE_MODEL_NAME: 'qwen3-vl-plus',
    MIDSCENE_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    MIDSCENE_MODEL_API_KEY: 'sk-...',
    MIDSCENE_MODEL_FAMILY: 'qwen3-vl'
  }
});
```

**Configure different models for different task types (via intent-specific keys):**

```typescript
const agent = new PuppeteerAgent(page, {
  modelConfig: {
    // default
    MIDSCENE_MODEL_NAME: 'qwen3-vl-plus',
    MIDSCENE_MODEL_API_KEY: 'sk-default-key',
    MIDSCENE_MODEL_BASE_URL: '.....',
    MIDSCENE_MODEL_FAMILY: 'qwen3-vl',

    // planning intent
    MIDSCENE_PLANNING_MODEL_NAME: 'gpt-5.1',
    MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key',
    MIDSCENE_PLANNING_MODEL_BASE_URL: '...',
    
    // insight intent
    MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus',
    MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key'
  }
});
```

### Custom OpenAI client

`createOpenAIClient: (openai, options) => Promise<OpenAI | undefined>` lets you wrap the OpenAI client instance for integrating observability tools (such as LangSmith, LangFuse) or applying custom middleware.

**Parameter Description:**

* `openai: OpenAI` - The base OpenAI client instance created by Midscene with all necessary configurations (API key, base URL, proxy, etc.)
* `options: Record<string, unknown>` - OpenAI initialization options, including:
  * `baseURL?: string` - API endpoint URL
  * `apiKey?: string` - API key
  * `dangerouslyAllowBrowser: boolean` - Always true in Midscene
  * Other OpenAI configuration options

**Return Value:**

* Return the wrapped OpenAI client instance, or `undefined` to use the original instance

**Example (LangSmith Integration):**

```typescript
import { wrapOpenAI } from 'langsmith/wrappers';

const agent = new PuppeteerAgent(page, {
  createOpenAIClient: async (openai, options) => {
    // Wrap with LangSmith for planning tasks
    if (options.baseURL?.includes('planning')) {
      return wrapOpenAI(openai, {
        metadata: { task: 'planning' }
      });
    }

    // Return the original client for other tasks
    return openai;
  }
});
```

**Note:** For LangSmith and Langfuse integration, we recommend using the environment-variable approach documented in [Model configuration](/model-config.md#using-langsmith), which requires no `createOpenAIClient` code. If you provide a custom client wrapper function, it will override the auto-integration behavior from environment variables.

## Interaction methods

Below are the main APIs available for the various Agents in Midscene.

:::info Auto Planning v.s. Instant Action

In Midscene, you can choose to use either auto planning or instant action.

* `agent.ai()` is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.
* `agent.aiTap()`, `agent.aiHover()`, `agent.aiInput()`, `agent.aiKeyboardPress()`, `agent.aiScroll()`, `agent.aiDoubleClick()`, `agent.aiRightClick()` are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.

:::

### `agent.aiAct()` or `agent.ai()`

This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.

:::info Backward Compatibility

This method was previously named `aiAction()` in earlier versions. The current version supports both names for backward compatibility. We recommend using the new `aiAct()` method for code consistency.

:::

* Type

```typescript
function aiAct(
  prompt: string,
  options?: {
    cacheable?: boolean;
    deepThink?: 'unset' | true | false;
    deepLocate?: boolean;
    fileChooserAccept?: string | string[];
    abortSignal?: AbortSignal;
  },
): Promise<void>;
function ai(prompt: string): Promise<void>; // shorthand form
```

* Parameters:

  * `prompt: string` - A natural language description of the UI steps.
  * `options?: Object` - Optional, a configuration object containing:
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.
    * `deepThink?: 'unset' | true | false` - Whether to enable deep thinking during planning when the model supports it (depends on MIDSCENE\_MODEL\_FAMILY). Default is `'unset'` (same as omitting) and follows the model provider's default strategy. When explicitly set to `true` or `false`, it overrides the `MIDSCENE_MODEL_REASONING_ENABLED` environment variable. [Learn more about deepThink](/model-strategy.md#about-the-deepthink-option-in-aiact).
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). False by default.
    * `fileChooserAccept?: string | string[]` - When a file chooser pops up, specify the file path(s) to accept. Can be a single file path or an array of paths. Only available in web pages (Playwright, Puppeteer).
      * **Note**: If the file input does not support multiple files (no `multiple` attribute) but multiple files are provided, an error will be thrown.
      * **Note**: If a file chooser is triggered but no `fileChooserAccept` parameter is provided, the file chooser will be ignored and the page can continue to operate normally.
    * `abortSignal?: AbortSignal` - An optional [AbortSignal](https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal) that allows you to abort the `aiAct` execution. When the signal is triggered, Midscene will stop the current planning loop and throw an error. This is useful for implementing timeouts or user-initiated cancellations.

* Return Value:

  * Returns a Promise that resolves to void when all steps are completed; if execution fails, an error is thrown.

* Examples:

```typescript
// Basic usage
await agent.aiAct(
  'Type "JavaScript" into the search box, then click the search button',
);

// Using the shorthand .ai form
await agent.ai(
  'Click the login button at the top of the page, then enter "test@example.com" in the username field',
);


// Using abortSignal to set a timeout
const controller = new AbortController();
setTimeout(() => controller.abort('timeout'), 30000); // 30s timeout
await agent.aiAct('Fill in the form and submit', {
  abortSignal: controller.signal,
});

// For complex tasks, you can enable the deepThink parameter
await agent.aiAct('Complete the GitHub account registration form. The region must be set to "Canada". Make sure no fields on the form are missed and all form fields pass validation. Just fill in the form fields without actually submitting the registration. Finally, return the actual content filled in the form fields', { deepThink: true });
```

:::tip

Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown.

For optimal results, please provide clear and detailed instructions for `agent.aiAct()`.

Related Documentation:

* [Model strategy](/model-strategy.md)

:::

### `agent.aiTap()`

Tap something.

* Type

```typescript
function aiTap(locate: string | Object, options?: Object): Promise<void>;
```

* Parameters:

  * `locate: string | Object` - A natural language description of the element to tap, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.
    * `fileChooserAccept?: string | string[]` - When a file chooser pops up, specify the file path(s) to accept. Can be a single file path or an array of paths. Only available in web pages (Playwright, Puppeteer).
      * **Note**: If the file input does not support multiple files (no `multiple` attribute) but multiple files are provided, an error will be thrown.
      * **Note**: If a file chooser is triggered but no `fileChooserAccept` parameter is provided, the file chooser will be ignored and the page can continue to operate normally.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiTap('The login button at the top of the page');

// Use deepLocate feature to precisely locate the element
await agent.aiTap('The login button at the top of the page', {
  deepLocate: true,
});

// File upload: tap the upload button and select files
await agent.aiTap('Choose file button', { fileChooserAccept: ['./document.pdf'] });
await agent.aiTap('Upload images', { fileChooserAccept: ['./image1.jpg', './image2.png'] });
```

### `agent.aiHover()`

> Only available in web pages, not available in Android.

Move mouse over something.

* Type

```typescript
function aiHover(locate: string | Object, options?: Object): Promise<void>;
```

* Parameters:

  * `locate: string | Object` - A natural language description of the element to hover over, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiHover('The version number of the current page');
```

### `agent.aiInput()`

Input text into something.

* Type

```typescript
// Recommended: locate first, then options with value
function aiInput(
  locate: string | Object,
  opt: {
    value: string | number;
    deepLocate?: boolean;
    xpath?: string;
    cacheable?: boolean;
    autoDismissKeyboard?: boolean;
    mode?: 'replace' | 'clear' | 'typeOnly';
  },
): Promise<void>;

// Backward compatible (legacy)
function aiInput(
  value: string | number,
  locate: string | Object,
  options?: Object,
): Promise<void>;
```

* Parameters:

  **Recommended usage:**

  * `locate: string | Object` - A natural language description of the element to input text into, or [prompting with images](#prompting-with-images).
  * `opt: Object` - Configuration object containing:
    * `value: string | number` - **Required**. The text content to input.
      * When `mode` is `'replace'`: The text will replace all existing content in the input field.
      * When `mode` is `'typeOnly'`: The text will be typed directly without clearing the field first.
      * When `mode` is `'clear'`: The text is ignored and the input field will be cleared.
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.
    * `autoDismissKeyboard?: boolean` - If true, the keyboard will be dismissed after input text, only available in Android/iOS. (Default: true)
    * `mode?: 'replace' | 'clear' | 'typeOnly'` - Input mode. (Default: 'replace')
      * `'replace'`: Clear the input field first, then input the text.
      * `'typeOnly'`: Type the value directly without clearing the field first.
      * `'clear'`: Clear the input field without entering new text.

  **Backward compatible usage (deprecated but still supported):**

  * `value: string | number` - The text content to input.
  * `locate: string | Object` - A natural language description of the element, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional configuration object. The type is the same as the `opt` type in the recommended usage.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
// Recommended
await agent.aiInput('The search input box', { value: 'Hello World' });

// Backward compatible (not recommended)
await agent.aiInput('Hello World', 'The search input box');
```

:::note Signature Update

We recently updated the `aiInput` API signature to put the locate prompt as the first parameter, making the parameter order more intuitive. The legacy signature `aiInput(value, locate, options)` is still fully compatible, but new code should use the recommended signature.

:::

### `agent.aiKeyboardPress()`

Press a keyboard key.

* Type

```typescript
// Recommended: locate first, then options with keyName
function aiKeyboardPress(
  locate: string | Object,
  opt: {
    keyName: string;
    deepLocate?: boolean;
    xpath?: string;
    cacheable?: boolean;
  },
): Promise<void>;

// Backward compatible (legacy)
function aiKeyboardPress(
  key: string,
  locate?: string | Object,
  options?: Object,
): Promise<void>;
```

* Parameters:

  **Recommended usage:**

  * `locate: string | Object` - A natural language description of the element to press the key on, or [prompting with images](#prompting-with-images).
  * `opt: Object` - Configuration object containing:
    * `keyName: string` - **Required**. The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported. Refer to the [complete list of supported key names in our source code](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/us-keyboard-layout.ts) for available values.
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

  **Backward compatible usage (deprecated but still supported):**

  * `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.
  * `locate?: string | Object` - Optional, a natural language description of the element, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional configuration object. The type is the same as the `opt` type in the recommended usage.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
// Recommended
await agent.aiKeyboardPress('The search input box', { keyName: 'Enter' });

// Backward compatible (not recommended)
await agent.aiKeyboardPress('Enter', 'The search input box');
```

:::note Signature Update

We recently updated the `aiKeyboardPress` API signature to put the locate prompt as the first parameter, making the parameter order more intuitive. The legacy signature `aiKeyboardPress(key, locate, options)` is still fully compatible, but new code should use the recommended signature.

:::

### `agent.aiScroll()`

Scroll a page or an element.

* Type

```typescript
// Recommended: locate first, then options with scroll parameters
function aiScroll(
  locate: string | Object | undefined,
  opt: {
    scrollType?: 'singleAction' | 'scrollToBottom' | 'scrollToTop' | 'scrollToRight' | 'scrollToLeft';
    direction?: 'down' | 'up' | 'left' | 'right';
    distance?: number | null;
    deepLocate?: boolean;
    xpath?: string;
    cacheable?: boolean;
  },
): Promise<void>;

// Backward compatible (legacy)
function aiScroll(
  scrollParam: PlanningActionParamScroll,
  locate?: string | Object,
  options?: Object,
): Promise<void>;
```

* Parameters:

  **Recommended usage:**

  * `locate: string | Object | undefined` - A natural language description of the element to scroll on, or [prompting with images](#prompting-with-images). If not provided or undefined, Midscene will perform scroll on the current mouse position.
  * `opt: Object` - Configuration object containing:
    * `scrollType?: 'singleAction' | 'scrollToBottom' | 'scrollToTop' | 'scrollToRight' | 'scrollToLeft'` - The scroll behavior. Defaults to `singleAction`.
    * `direction?: 'down' | 'up' | 'right' | 'left'` - The direction to scroll. Defaults to `down`. Only effective when `scrollType` is `singleAction`. Whether it is Android or Web, the scrolling direction here all refers to which direction of the page's content will appear on the screen. For example, when the scrolling direction is `down`, the hidden content at the bottom of the page will gradually reveal itself from the bottom of the screen upwards.
    * `distance?: number | null` - Optional, the distance to scroll in px. Use `null` to allow Midscene to decide automatically.
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

  **Backward compatible usage (deprecated but still supported):**

  * `scrollParam: PlanningActionParamScroll` - The scroll parameter (contains scrollType, direction, distance).
  * `locate?: string | Object` - Optional, a natural language description of the element, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional configuration object. The type is the same as the `opt` type in the recommended usage.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
// Recommended
await agent.aiScroll('The form panel', {
  scrollType: 'singleAction',
  direction: 'up',
  distance: 100,
});

// Backward compatible (not recommended)
await agent.aiScroll(
  { scrollType: 'singleAction', direction: 'up', distance: 100 },
  'The form panel',
);
```

:::note Signature Update

We recently updated the `aiScroll` API signature to put the locate prompt as the first parameter, making the parameter order more intuitive. The legacy signature `aiScroll(scrollParam, locate, options)` is still fully compatible, but new code should use the recommended signature.

:::

### `agent.aiDoubleClick()`

Double-click on an element.

* Type

```typescript
function aiDoubleClick(locate: string | Object, options?: Object): Promise<void>;
```

* Parameters:

  * `locate: string | Object` - A natural language description of the element to double-click on, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiDoubleClick('The file name at the top of the page');

// Use deepLocate feature to precisely locate the element
await agent.aiDoubleClick('The file name at the top of the page', {
  deepLocate: true,
});
```

### `agent.aiRightClick()`

> Only available in web pages, not available in Android.

Right-click on an element. Please note that Midscene cannot interact with the native context menu in browser after right-clicking. This interface is usually used for the element that listens to the right-click event by itself.

* Type

```typescript
function aiRightClick(locate: string, options?: Object): Promise<void>;
```

* Parameters:

  * `locate: string | Object` - A natural language description of the element to right-click on, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiRightClick('The file name at the top of the page');

// Use deepLocate feature to precisely locate the element
await agent.aiRightClick('The file name at the top of the page', {
  deepLocate: true,
});
```

## Data extraction

### `agent.aiAsk()`

Ask the AI model any question about the current page. It returns the answer in string from the AI model.

* Type

```typescript
function aiAsk(prompt: string | Object, options?: Object): Promise<string>;
```

* Parameters:

  * `prompt: string | Object` - A natural language description of the question, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Return a Promise. Return the answer from the AI model.

* Examples:

```typescript
const result = await agent.aiAsk('What should I do to test this page?');
console.log(result); // Output the answer from the AI model
```

Besides `aiAsk`, you can also use `aiQuery` to extract structured data from the UI.

### `agent.aiQuery()`

This method allows you to extract structured data from current page. Simply define the expected format (e.g., string, number, JSON, or an array) in the `dataDemand`, and Midscene will return a result that matches the format.

* Type

```typescript
function aiQuery<T>(dataDemand: string | Object, options?: Object): Promise<T>;
```

* Parameters:

  * `dataDemand: T`: A description of the expected data and its return format.
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Returns any valid basic type, such as string, number, JSON, array, etc.
  * Just describe the format in `dataDemand`, and Midscene will return a matching result.

* Examples:

```typescript
const dataA = await agent.aiQuery({
  time: 'The date and time displayed in the top-left corner as a string',
  userInfo: 'User information in the format {name: string}',
  tableFields: 'An array of table field names, string[]',
  tableDataRecord:
    'Table records in the format {id: string, [fieldName]: string}[]',
});

// You can also describe the expected return format using a string:

// dataB will be an array of strings
const dataB = await agent.aiQuery('string[], list of task names');

// dataC will be an array of objects
const dataC = await agent.aiQuery(
  '{name: string, age: string}[], table data records',
);

// Use domIncluded feature to extract invisible attributes
const dataD = await agent.aiQuery(
  '{name: string, age: string, avatarUrl: string}[], table data records',
  { domIncluded: true },
);
```

### `agent.aiBoolean()`

Extract a boolean value from the UI.

* Type

```typescript
function aiBoolean(prompt: string | Object, options?: Object): Promise<boolean>;
```

* Parameters:
  * `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Returns a `Promise<boolean>` when AI returns a boolean value.

* Examples:

```typescript
const boolA = await agent.aiBoolean('Whether there is a login dialog');

// Use domIncluded feature to extract invisible attributes
const boolB = await agent.aiBoolean('Whether the login button has a link', {
  domIncluded: true,
});
```

### `agent.aiNumber()`

Extract a number value from the UI.

* Type

```typescript
function aiNumber(prompt: string | Object, options?: Object): Promise<number>;
```

* Parameters:
  * `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Returns a `Promise<number>` when AI returns a number value.

* Examples:

```typescript
const numberA = await agent.aiNumber('The remaining points of the account');

// Use domIncluded feature to extract invisible attributes
const numberB = await agent.aiNumber(
  'The value of the remaining points element',
  { domIncluded: true },
);
```

### `agent.aiString()`

Extract a string value from the UI.

* Type

```typescript
function aiString(prompt: string | Object, options?: Object): Promise<string>;
```

* Parameters:
  * `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Returns a `Promise<string>` when AI returns a string value.

* Examples:

```typescript
const stringA = await agent.aiString('The first item in the list');

// Use domIncluded feature to extract invisible attributes
const stringB = await agent.aiString('The link of the first item in the list', {
  domIncluded: true,
});
```

## More APIs

### `agent.aiAssert()`

Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional `errorMsg` and a detailed reason generated by the AI.

* Type

```typescript
function aiAssert(
  assertion: string | Object, 
  errorMsg?: string, 
  options?: Object
): Promise<void>;
```

* Parameters:

  * `assertion: string | Object` - The assertion described in natural language, or [prompting with images](#prompting-with-images).
  * `errorMsg?: string` - An optional error message to append if the assertion fails.
  * `options?: Object` - Optional, a configuration object containing:
    * `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
    * `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.

* Return Value:

  * Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with `errorMsg` and additional AI-provided information.

* Example:

```typescript
await agent.aiAssert('The price of "Sauce Labs Onesie" is 7.99');
```

:::tip

Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine `.aiQuery` with standard JavaScript assertions instead of using `.aiAssert`.

For example, you might replace the above code with:

```typescript
const items = await agent.aiQuery(
  '{name: string, price: number}[], return product names and prices',
);
const onesieItem = items.find((item) => item.name === 'Sauce Labs Onesie');
expect(onesieItem).toBeTruthy();
expect(onesieItem.price).toBe(7.99);
```

:::

### `agent.aiLocate()`

Locate an element using natural language.

* Type

```typescript
function aiLocate(
  locate: string | Object,
  options?: Object,
): Promise<{
  rect: {
    left: number;
    top: number;
    width: number;
    height: number;
  };
  center: [number, number];
  dpr: number; // device pixel ratio
}>;
```

* Parameters:

  * `locate: string | Object` - A natural language description of the element to locate, or [prompting with images](#prompting-with-images).
  * `options?: Object` - Optional, a configuration object containing:
    * `deepLocate?: boolean` - Whether to enable [Deep Locate](#deep-locate-deeplocate). Previously named `deepThink`, now renamed to `deepLocate`. False by default.
    * `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](/caching.md). True by default.

* Return Value:

  * Returns a `Promise` when the element is located parsed as an locate info object.

* Examples:

```typescript
const locateInfo = await agent.aiLocate(
  'The login button at the top of the page',
);
console.log(locateInfo);
```

### `agent.aiWaitFor()`

Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified `checkIntervalMs`.

* Type

```typescript
function aiWaitFor(
  assertion: string,
  options?: {
    timeoutMs?: number;
    checkIntervalMs?: number;
  },
): Promise<void>;
```

* Parameters:

  * `assertion: string` - The condition described in natural language.
  * `options?: object` - An optional configuration object containing:
    * `timeoutMs?: number` - Maximum window in milliseconds for starting a new check (default: 15000). If the previous evaluation began before this window closes, Midscene keeps checking; otherwise it stops with a timeout.
    * `checkIntervalMs?: number` - Interval for checking in milliseconds (default: 3000).

* Return Value:

  * Returns a Promise that resolves to void if the condition is met; if not, an error is thrown when the timeout is reached.

* Examples:

```typescript
// Basic usage
await agent.aiWaitFor(
  'There is at least one headphone information displayed on the interface',
);

// Using custom options
await agent.aiWaitFor('The shopping cart icon shows a quantity of 2', {
  timeoutMs: 30000, // Wait for 30 seconds
  checkIntervalMs: 5000, // Check every 5 seconds
});
```

:::tip

Given the time consumption of AI services, `.aiWaitFor` might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.

:::

### `agent.runYaml()`

Execute an automation script written in YAML. Only the `tasks` part of the script is parsed and executed, and it returns the results of all `.aiQuery` calls within the script.

* Type

```typescript
function runYaml(yamlScriptContent: string): Promise<{ result: any }>;
```

* Parameters:

  * `yamlScriptContent: string` - The YAML-formatted script content.

* Return Value:

  * Returns an object with a `result` property that includes the results of all `.aiQuery` calls.

* Example:

```typescript
const { result } = await agent.runYaml(`
tasks:
  - name: search weather
    flow:
      - ai: input 'weather today' in input box, click search button
      - sleep: 3000

  - name: query weather
    flow:
      - aiQuery: "the result shows the weather info, {description: string}"
`);
console.log(result);
```

:::tip

For more information about YAML scripts, please refer to [Automate with Scripts in YAML](/automate-with-scripts-in-yaml.md).

:::

### `agent.setAIActContext()`

Set the background knowledge that should be sent to the AI model when calling `agent.aiAct()` or `agent.ai()`. This will override the previous setting.

For instant action type APIs, like `aiTap()`, this setting will not take effect.

* Type

```typescript
function setAIActContext(aiActContext: string): void;
```

* Parameters:

  * `aiActContext: string` - The background knowledge that should be sent to the AI model. The deprecated `aiActionContext` name is still accepted.

* Example:

```typescript
await agent.setAIActContext('Close the cookie consent dialog first if it exists');
```

:::note

`agent.setAIActionContext()` is deprecated. Please use `agent.setAIActContext()` instead. The deprecated method remains as an alias for compatibility.

:::

### `agent.evaluateJavaScript()`

> Only available in web pages, not available in Android.

Evaluate a JavaScript expression in the web page context.

* Type

```typescript
function evaluateJavaScript(script: string): Promise<any>;
```

* Parameters:

  * `script: string` - The JavaScript expression to evaluate.

* Return Value:

  * Returns the result of the JavaScript expression.

* Example:

```typescript
const result = await agent.evaluateJavaScript('document.title');
console.log(result);
```

### `agent.recordToReport()`

Log the current screenshot with a description in the report file.

* Type

```typescript
function recordToReport(title?: string, options?: Object): Promise<void>;
```

* Parameters:

  * `title?: string` - Optional, the title of the screenshot, if not provided, the title will be 'untitled'.
  * `options?: Object` - Optional, a configuration object containing:
    * `content?: string` - The description of the screenshot.

* Return Value:

  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.recordToReport('Login page', {
  content: 'User A',
});
```

### `agent.freezePageContext()`

Freeze the current page context, allowing all subsequent operations to reuse the same page snapshot without retrieving the page state repeatedly. This significantly improves performance when executing a large number of concurrent operations.

Some notes:

* Usually, you do not need to use this method, unless you are certain that "context retrieval" is the bottleneck of your test script.
* You need to call `agent.unfreezePageContext()` in time to restore the real-time page state.
* Do not call this method in interaction operations, it will make the AI model unable to perceive the latest page state, causing confusing errors.

- Type

```typescript
function freezePageContext(): Promise<void>;
```

* Return Value:

  * `Promise<void>`

* Examples:

```typescript
// Freeze the page context
await agent.freezePageContext();

// Some queries...
const results = await Promise.all([
  await agent.aiQuery('Username input box value'),
  await agent.aiQuery('Password input box value'),
  await agent.aiLocate('Login button'),
]);
console.log(results);

// Unfreeze the page context, subsequent operations will use real-time page state
await agent.unfreezePageContext();
```

:::tip

In the report, operations using frozen context will display a 🧊 icon in the Insight tab.

:::

### `agent.unfreezePageContext()`

Unfreezes the page context, restoring the use of real-time page state.

* Type

```typescript
function unfreezePageContext(): Promise<void>;
```

* Return Value:

  * `Promise<void>`

### `agent._unstableLogContent()`

Retrieve the log content from the report file. The structure of the log object may change in future versions.

* Type

```typescript
function _unstableLogContent(): Object;
```

* Return Value:

  * Returns an object that contains the log content.

* Examples:

```typescript
const logContent = agent._unstableLogContent();
console.log(logContent);
```

## Deep Locate (deepLocate)

`deepLocate` is an optional parameter available on all APIs that require element location (`aiTap`, `aiHover`, `aiInput`, `aiKeyboardPress`, `aiScroll`, `aiDoubleClick`, `aiRightClick`, `aiLocate`, etc.).

When enabled, Midscene will call the AI model twice to precisely locate the element, which can improve accuracy. This is especially useful when the target element is small or hard to distinguish from its surroundings. For newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less noticeable in most cases, so enable it only when needed.

* **Default**: `false`
* **Parameter renamed**: Previously named `deepThink`, renamed to `deepLocate` since v1.5.1. The old `deepThink` parameter remains compatible.

```typescript
// Enable deepLocate to precisely locate hard-to-identify elements
await agent.aiTap('Shopping cart icon in the top right', { deepLocate: true });
```

:::note

`deepThink` in `aiAct()` and `deepLocate` here are completely different features:

* `deepLocate`: Controls the **precision of element location** by making Midscene run two locate passes to improve accuracy.
* `deepThink` in `aiAct()`: Controls the **reasoning strategy during planning** (toggles AI reasoning capability).

[See aiAct deepThink documentation](/model-strategy.md#about-the-deepthink-option-in-aiact)

:::

## Prompting with images

You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.

When prompting with images, the format of the prompt parameters is as follows:

```javascript
{
  // Prompt text, in which images can be referred
  prompt: string,
  // The images referred in the prompt text
  images?: {
    // Image name, corresponding to the names referred in the prompt text
    name: string,
    // Image url, can be a local image path, Base64 string, or http link
    url: string
  }[]
  // When convertHttpImage2Base64 is true, the image links in the http format will be converted into Base64 encoding and sent to the LLM.
  // Which is applicable when the image links are not publicly accessible.
  convertHttpImage2Base64?: boolean
}
```

* Example 1: use images to inspect the tap position.

```javascript
await agent.aiTap({
  prompt: 'The specific logo',
  images: [
    {
      name: 'The specific logo',
      url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
    },
  ],
});
```

* Example 2: use images to assert the page content.

```javascript
await agent.aiAssert({
  prompt: 'Whether there is a specific logo on the page.',
  images: [
    {
      name: 'The specific logo',
      url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
    },
  ],
});
```

**Notes on Image Size**

When prompting with images, it is necessary to pay attention to the requirements of the AI model provider regarding the size and dimensions of the images. Images that are too large (such as exceeding 10M) or too small (such as being less than 10 pixels) may cause errors when the model is invoked. The specific restrictions should be based on the documentation provided by the AI model provider you are using.

## Properties

### `.reportFile`

The path to the report file.

## Report Merging Tool

When running multiple automation workflows, each agent generates its own report file. The `ReportMergingTool` can merge multiple automation reports into a single report for unified viewing and management.

### Use cases

* Running multiple workflows in a suite and need one consolidated report
* Cross-platform automation (for example, Web and Android) that needs a unified result
* CI/CD pipelines that require a summarized automation report

### `new ReportMergingTool()`

Create a report merging tool instance.

* Example:

```typescript
import { ReportMergingTool } from '@midscene/core/report';

const reportMergingTool = new ReportMergingTool();
```

### `.append()`

Add an automation report to the list to be merged, typically right after each workflow finishes.

* Type

```typescript
function append(reportInfo: ReportFileWithAttributes): void;
```

* Parameters:

  * `reportInfo: ReportFileWithAttributes` - Report information object containing:
    * `reportFilePath: string` - Path to the report file, usually `agent.reportFile`
    * `reportAttributes: object` - Report attributes
      * `testId: string` - Unique identifier for the automation workflow
      * `testTitle: string` - Automation workflow title
      * `testDescription: string` - Automation workflow description
      * `testDuration: number` - Automation execution duration (in milliseconds)
      * `testStatus: 'passed' | 'failed' | 'timedOut' | 'skipped' | 'interrupted'` - Automation status

* Return Value:

  * `void`

* Example:

```typescript
// Add a report in your test hooks
afterEach((ctx) => {
  let workflowStatus = 'passed';
  if (ctx.task.result?.state === 'fail') {
    workflowStatus = 'failed';
  }

  reportMergingTool.append({
    reportFilePath: agent.reportFile as string,
    reportAttributes: {
      testId: ctx.task.name,
      testTitle: ctx.task.name,
      testDescription: 'Automation workflow description',
      testDuration: Date.now() - startTime,
      testStatus: workflowStatus,
    },
  });
});
```

### `.mergeReports()`

Merge all added reports into a single HTML file.

* Type

```typescript
function mergeReports(
  reportFileName?: 'AUTO' | string,
  opts?: {
    rmOriginalReports?: boolean;
    overwrite?: boolean;
  },
): string | null;
```

* Parameters:

  * `reportFileName?: 'AUTO' | string` - Name of the merged report file
    * Defaults to `'AUTO'`, which will generate a file name automatically
    * You can also provide a custom name (no `.html` suffix needed)
  * `opts?: object` - Optional configuration object
    * `rmOriginalReports?: boolean` - Whether to delete the original report files after merging (default: `false`)
    * `overwrite?: boolean` - Whether to overwrite when the target file already exists (default: `false`)

* Return Value:

  * Returns the merged report file path when successful
  * Returns `null` if there are fewer than two reports to merge

* Examples:

```typescript
// Basic usage with an auto-generated file name
afterAll(() => {
  reportMergingTool.mergeReports();
});

// Specify a custom file name
afterAll(() => {
  reportMergingTool.mergeReports('my-automation-report');
});

// Merge and delete the original reports
afterAll(() => {
  reportMergingTool.mergeReports('my-automation-report', {
    rmOriginalReports: true,
  });
});

// Overwrite an existing report file
afterAll(() => {
  reportMergingTool.mergeReports('my-automation-report', {
    overwrite: true,
  });
});
```

### `.clear()`

Clear the list of reports to be merged. Use this if you need to reuse the same instance for multiple merge operations.

* Type

```typescript
function clear(): void;
```

* Return Value:

  * `void`

* Example:

```typescript
reportMergingTool.mergeReports('first-batch');
reportMergingTool.clear(); // Clear the list
// Continue adding new reports...
```

### Full example

Below is a complete example of using `ReportMergingTool` in a Vitest suite:

```typescript
import { describe, it, beforeEach, afterEach, afterAll } from 'vitest';
import { AndroidAgent, AndroidDevice } from '@midscene/android';
import { ReportMergingTool } from '@midscene/core/report';

describe('Android settings automation', () => {
  let device: AndroidDevice;
  let agent: AndroidAgent;
  let startTime: number;
  const reportMergingTool = new ReportMergingTool();

  beforeEach((ctx) => {
    startTime = performance.now();
    agent = new AndroidAgent(device, {
      groupName: ctx.task.name,
    });
  });

  afterEach((ctx) => {
    let workflowStatus = 'passed';
    if (ctx.task.result?.state === 'pass') {
      workflowStatus = 'passed';
    } else if (ctx.task.result?.state === 'skip') {
      workflowStatus = 'skipped';
    } else if (ctx.task.result?.errors?.[0]?.message.includes('timed out')) {
      workflowStatus = 'timedOut';
    } else {
      workflowStatus = 'failed';
    }

    // Add the report to the merge list
    reportMergingTool.append({
      reportFilePath: agent.reportFile as string,
      reportAttributes: {
        testId: ctx.task.name,
        testTitle: ctx.task.name,
        testDescription: 'Automation workflow description',
        testDuration: (Date.now() - ctx.task.result?.startTime!) | 0,
        testStatus: workflowStatus,
      },
    });
  });

  afterAll(() => {
    // Merge all automation reports
    reportMergingTool.mergeReports('android-settings-automation-report');
  });

  it('toggle WLAN', async () => {
    await agent.aiAct('Find and open WLAN settings');
    await agent.aiAct('Toggle WLAN once');
  });

  it('toggle Bluetooth', async () => {
    await agent.aiAct('Find and open Bluetooth settings');
    await agent.aiAct('Toggle Bluetooth once');
  });
});
```

:::tip

The merged report is saved under the `midscene_run/report` directory. Open the generated HTML file in your browser to review the workflows.

:::
