API reference (Common)
Constructors
Midscene ships multiple agents tuned for specific automation environments. Every constructor takes the target page/device plus a shared options bag (reporting, caching, AI configuration, hooks), and then layers on platform-only switches such as navigation guards in browsers or ADB wiring on Android. Use the sections below to review the import path and platform-specific parameters for each agent:
- In Puppeteer, use PuppeteerAgent
- In Playwright, use PlaywrightAgent
- In Bridge Mode, use AgentOverChromeBridge
- On Android, use Android API reference
- On iOS, use iOS API reference
- For GUI agents integrating with your own interface, refer to Custom Interface Agent
Parameters
All agents share these base options:
generateReport: boolean: If true, a report file will be generated. (Default: true)reportFileName: string: The name of the report file. (Default: generated by midscene)autoPrintReportMsg: boolean: If true, report messages will be printed. (Default: true)cacheId: string | undefined: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)aiActContext: string: Some background knowledge that should be sent to the AI model when callingagent.aiAct(), like 'close the cookie consent dialog first if it exists' (Default: undefined). Previously exposed asaiActionContext; the legacy name is still accepted for backward compatibility.replanningCycleLimit: number: The maximum number ofaiActreplanning cycles. Default is 20 (40 for UI-TARS models). Prefer setting this via the agent option; readingMIDSCENE_REPLANNING_CYCLE_LIMITis only for backward compatibility.onTaskStartTip: (tip: string) => void | Promise<void>: Optional hook that fires before each execution task begins with a human-readable summary of the task (Default: undefined)
Custom model configuration
Use modelConfig: Record<string, string | number> to configure models directly in code instead of environment variables.
If
modelConfigis provided at agent initialization, all system environment variables for model config are ignored. Only the values in this object are used. The keys/values you can set here are the same ones listed in Model configuration. You can also see explanation in Model strategy.
Basic Example (single model for all intents):
Configure different models for different task types (via intent-specific keys):
Custom OpenAI client
createOpenAIClient: (openai, options) => Promise<OpenAI | undefined> lets you wrap the OpenAI client instance for integrating observability tools (such as LangSmith, LangFuse) or applying custom middleware.
Parameter Description:
openai: OpenAI- The base OpenAI client instance created by Midscene with all necessary configurations (API key, base URL, proxy, etc.)options: Record<string, unknown>- OpenAI initialization options, including:baseURL?: string- API endpoint URLapiKey?: string- API keydangerouslyAllowBrowser: boolean- Always true in Midscene- Other OpenAI configuration options
Return Value:
- Return the wrapped OpenAI client instance, or
undefinedto use the original instance
Example (LangSmith Integration):
Note: For LangSmith and Langfuse integration, we recommend using the environment-variable approach documented in Model configuration, which requires no createOpenAIClient code. If you provide a custom client wrapper function, it will override the auto-integration behavior from environment variables.
Interaction methods
Below are the main APIs available for the various Agents in Midscene.
In Midscene, you can choose to use either auto planning or instant action.
agent.ai()is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.agent.aiTap(),agent.aiHover(),agent.aiInput(),agent.aiKeyboardPress(),agent.aiScroll(),agent.aiDoubleClick(),agent.aiRightClick()are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.
agent.aiAct() or .ai()
This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.
This method was previously named aiAction() in earlier versions. The current version supports both names for backward compatibility. We recommend using the new aiAct() method for code consistency.
- Type
-
Parameters:
prompt: string- A natural language description of the UI steps.options?: Object- Optional, a configuration object containing:cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a Promise that resolves to void when all steps are completed; if execution fails, an error is thrown.
-
Examples:
Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown.
For optimal results, please provide clear and detailed instructions for agent.aiAct().
Related Documentation:
agent.aiTap()
Tap something.
- Type
-
Parameters:
locate: string | Object- A natural language description of the element to tap, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiHover()
Only available in web pages, not available in Android.
Move mouse over something.
- Type
-
Parameters:
locate: string | Object- A natural language description of the element to hover over, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiInput()
Input text into something.
- Type
-
Parameters:
text: string- The text content to input.- When
modeis'replace': The text will replace all existing content in the input field. - When
modeis'append': The text will be appended to the existing content. - When
modeis'clear': The text is ignored and the input field will be cleared.
- When
locate: string | Object- A natural language description of the element to input text into, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.autoDismissKeyboard?: boolean- If true, the keyboard will be dismissed after input text, only available in Android/iOS. (Default: true)mode?: 'replace' | 'clear' | 'append'- Input mode. (Default: 'replace')'replace': Clear the input field first, then input the text.'append': Append the text to existing content without clearing.'clear': Clear the input field without entering new text.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiKeyboardPress()
Press a keyboard key.
- Type
-
Parameters:
key: string- The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported. Refer to the complete list of supported key names in our source code for available values.locate?: string | Object- Optional, a natural language description of the element to press the key on, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiScroll()
Scroll a page or an element.
- Type
-
Parameters:
scrollParam: PlanningActionParamScroll- The scroll parameterdirection?: 'down' | 'up' | 'right' | 'left'- The direction to scroll. Defaults todown. Whether it is Android or Web, the scrolling direction here all refers to which direction of the page's content will appear on the screen. For example, when the scrolling direction isdown, the hidden content at the bottom of the page will gradually reveal itself from the bottom of the screen upwards.scrollType?: 'singleAction' | 'scrollToBottom' | 'scrollToTop' | 'scrollToRight' | 'scrollToLeft'- The scroll behavior. Defaults tosingleAction.distance?: number | null- Optional, the distance to scroll in px. Usenullto allow Midscene to decide automatically.
locate?: string | Object- Optional, a natural language description of the element to scroll on, or prompting with images. If not provided, Midscene will perform scroll on the current mouse position.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiDoubleClick()
Double-click on an element.
- Type
-
Parameters:
locate: string | Object- A natural language description of the element to double-click on, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.aiRightClick()
Only available in web pages, not available in Android.
Right-click on an element. Please note that Midscene cannot interact with the native context menu in browser after right-clicking. This interface is usually used for the element that listens to the right-click event by itself.
- Type
-
Parameters:
locate: string | Object- A natural language description of the element to right-click on, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
deepThink feature
The deepThink feature makes Midscene issue two locating requests to improve accuracy. It is useful when the model finds it hard to distinguish the element from its surroundings. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious, so enable it only when needed.
Data extraction
agent.aiAsk()
Ask the AI model any question about the current page. It returns the answer in string from the AI model.
- Type
-
Parameters:
prompt: string | Object- A natural language description of the question, or prompting with images.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Return a Promise. Return the answer from the AI model.
-
Examples:
Besides aiAsk, you can also use aiQuery to extract structured data from the UI.
agent.aiQuery()
This method allows you to extract structured data from current page. Simply define the expected format (e.g., string, number, JSON, or an array) in the dataDemand, and Midscene will return a result that matches the format.
- Type
-
Parameters:
dataDemand: T: A description of the expected data and its return format.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Returns any valid basic type, such as string, number, JSON, array, etc.
- Just describe the format in
dataDemand, and Midscene will return a matching result.
-
Examples:
agent.aiBoolean()
Extract a boolean value from the UI.
- Type
-
Parameters:
prompt: string | Object- A natural language description of the expected value, or prompting with images.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Returns a
Promise<boolean>when AI returns a boolean value.
- Returns a
-
Examples:
agent.aiNumber()
Extract a number value from the UI.
- Type
-
Parameters:
prompt: string | Object- A natural language description of the expected value, or prompting with images.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Returns a
Promise<number>when AI returns a number value.
- Returns a
-
Examples:
agent.aiString()
Extract a string value from the UI.
- Type
-
Parameters:
prompt: string | Object- A natural language description of the expected value, or prompting with images.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Returns a
Promise<string>when AI returns a string value.
- Returns a
-
Examples:
More APIs
agent.aiAssert()
Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional errorMsg and a detailed reason generated by the AI.
- Type
-
Parameters:
assertion: string | Object- The assertion described in natural language, or prompting with images.errorMsg?: string- An optional error message to append if the assertion fails.options?: Object- Optional, a configuration object containing:domIncluded?: boolean | 'visible-only'- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to'visible-only', only the visible elements will be sent. False by default.screenshotIncluded?: boolean- Whether to send screenshot to the model. True by default.
-
Return Value:
- Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with
errorMsgand additional AI-provided information.
- Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with
-
Example:
Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine .aiQuery with standard JavaScript assertions instead of using .aiAssert.
For example, you might replace the above code with:
agent.aiLocate()
Locate an element using natural language.
- Type
-
Parameters:
locate: string | Object- A natural language description of the element to locate, or prompting with images.options?: Object- Optional, a configuration object containing:deepThink?: boolean- If true, Midscene will call AI model twice to precisely locate the element, which can improve accuracy. False by default. With newer models (e.g. Qwen3 / Doubao 1.6 / Gemini 3), the gain is less obvious.xpath?: string- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean- Whether cacheable when enabling caching feature. True by default.
-
Return Value:
- Returns a
Promisewhen the element is located parsed as an locate info object.
- Returns a
-
Examples:
agent.aiWaitFor()
Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified checkIntervalMs.
- Type
-
Parameters:
assertion: string- The condition described in natural language.options?: object- An optional configuration object containing:timeoutMs?: number- Maximum window in milliseconds for starting a new check (default: 15000). If the previous evaluation began before this window closes, Midscene keeps checking; otherwise it stops with a timeout.checkIntervalMs?: number- Interval for checking in milliseconds (default: 3000).
-
Return Value:
- Returns a Promise that resolves to void if the condition is met; if not, an error is thrown when the timeout is reached.
-
Examples:
Given the time consumption of AI services, .aiWaitFor might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.
agent.runYaml()
Execute an automation script written in YAML. Only the tasks part of the script is parsed and executed, and it returns the results of all .aiQuery calls within the script.
- Type
-
Parameters:
yamlScriptContent: string- The YAML-formatted script content.
-
Return Value:
- Returns an object with a
resultproperty that includes the results of all.aiQuerycalls.
- Returns an object with a
-
Example:
For more information about YAML scripts, please refer to Automate with Scripts in YAML.
agent.setAIActContext()
Set the background knowledge that should be sent to the AI model when calling agent.aiAct() or agent.ai(). This will override the previous setting.
For instant action type APIs, like aiTap(), this setting will not take effect.
- Type
-
Parameters:
aiActContext: string- The background knowledge that should be sent to the AI model. The deprecatedaiActionContextname is still accepted.
-
Example:
agent.setAIActionContext() is deprecated. Please use agent.setAIActContext() instead. The deprecated method remains as an alias for compatibility.
agent.evaluateJavaScript()
Only available in web pages, not available in Android.
Evaluate a JavaScript expression in the web page context.
- Type
-
Parameters:
script: string- The JavaScript expression to evaluate.
-
Return Value:
- Returns the result of the JavaScript expression.
-
Example:
agent.recordToReport()
Log the current screenshot with a description in the report file.
- Type
-
Parameters:
title?: string- Optional, the title of the screenshot, if not provided, the title will be 'untitled'.options?: Object- Optional, a configuration object containing:content?: string- The description of the screenshot.
-
Return Value:
- Returns a
Promise<void>
- Returns a
-
Examples:
agent.freezePageContext()
Freeze the current page context, allowing all subsequent operations to reuse the same page snapshot without retrieving the page state repeatedly. This significantly improves performance when executing a large number of concurrent operations.
Some notes:
- Usually, you do not need to use this method, unless you are certain that "context retrieval" is the bottleneck of your test script.
- You need to call
agent.unfreezePageContext()in time to restore the real-time page state. - Do not call this method in interaction operations, it will make the AI model unable to perceive the latest page state, causing confusing errors.
- Type
-
Return Value:
Promise<void>
-
Examples:
In the report, operations using frozen context will display a 🧊 icon in the Insight tab.
agent.unfreezePageContext()
Unfreezes the page context, restoring the use of real-time page state.
- Type
-
Return Value:
Promise<void>
agent._unstableLogContent()
Retrieve the log content from the report file. The structure of the log object may change in future versions.
- Type
-
Return Value:
- Returns an object that contains the log content.
-
Examples:
Prompting with images
You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.
When prompting with images, the format of the prompt parameters is as follows:
- Example 1: use images to inspect the tap position.
- Example 2: use images to assert the page content.
Notes on Image Size
When prompting with images, it is necessary to pay attention to the requirements of the AI model provider regarding the size and dimensions of the images. Images that are too large (such as exceeding 10M) or too small (such as being less than 10 pixels) may cause errors when the model is invoked. The specific restrictions should be based on the documentation provided by the AI model provider you are using.
Properties
.reportFile
The path to the report file.
Report Merging Tool
When running multiple automation workflows, each agent generates its own report file. The ReportMergingTool can merge multiple automation reports into a single report for unified viewing and management.
Use cases
- Running multiple workflows in a suite and need one consolidated report
- Cross-platform automation (for example, Web and Android) that needs a unified result
- CI/CD pipelines that require a summarized automation report
new ReportMergingTool()
Create a report merging tool instance.
- Example:
.append()
Add an automation report to the list to be merged, typically right after each workflow finishes.
- Type
-
Parameters:
reportInfo: ReportFileWithAttributes- Report information object containing:reportFilePath: string- Path to the report file, usuallyagent.reportFilereportAttributes: object- Report attributestestId: string- Unique identifier for the automation workflowtestTitle: string- Automation workflow titletestDescription: string- Automation workflow descriptiontestDuration: number- Automation execution duration (in milliseconds)testStatus: 'passed' | 'failed' | 'timedOut' | 'skipped' | 'interrupted'- Automation status
-
Return Value:
void
-
Example:
.mergeReports()
Merge all added reports into a single HTML file.
- Type
-
Parameters:
reportFileName?: 'AUTO' | string- Name of the merged report file- Defaults to
'AUTO', which will generate a file name automatically - You can also provide a custom name (no
.htmlsuffix needed)
- Defaults to
opts?: object- Optional configuration objectrmOriginalReports?: boolean- Whether to delete the original report files after merging (default:false)overwrite?: boolean- Whether to overwrite when the target file already exists (default:false)
-
Return Value:
- Returns the merged report file path when successful
- Returns
nullif there are fewer than two reports to merge
-
Examples:
.clear()
Clear the list of reports to be merged. Use this if you need to reuse the same instance for multiple merge operations.
- Type
-
Return Value:
void
-
Example:
Full example
Below is a complete example of using ReportMergingTool in a Vitest suite:
The merged report is saved under the midscene_run/report directory. Open the generated HTML file in your browser to review the workflows.

