In the documentation below, you might see function calls prefixed with
agent.
. If you utilize destructuring in Playwright (e.g.,async ({ ai, aiQuery }) => { /* ... */ }
), you can call these functions without theagent.
prefix. This is merely a syntactical difference.
Each Agent in Midscene has its own constructor.
These Agents share some common constructor parameters:
generateReport: boolean
: If true, a report file will be generated. (Default: true)autoPrintReportMsg: boolean
: If true, report messages will be printed. (Default: true)cacheId: string | undefined
: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)In Puppeteer, there is an additional parameter:
forceSameTabNavigation: boolean
: If true, page navigation is restricted to the current tab. (Default: true)
Below are the main APIs available for the various Agents in Midscene.
In Midscene, you can choose to use either auto planning or instant action.
agent.ai()
is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.agent.aiTap()
, agent.aiHover()
, agent.aiInput()
, agent.aiKeyboardPress()
, agent.aiScroll()
are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.agent.aiAction()
or .ai()
This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.
Parameters:
prompt: string
- A natural language description of the UI steps.Return Value:
Examples:
Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown.
For optimal results, please provide clear and detailed instructions for agent.aiAction()
. For guides about writing prompts, you may read this doc: Tips for Writing Prompts.
Related Documentation:
agent.aiTap()
Tap something.
Parameters:
locate: string
- A natural language description of the element to tap.Return Value:
Promise<void>
Examples:
agent.aiHover()
Move mouse over something.
Parameters:
locate: string
- A natural language description of the element to hover over.Return Value:
Promise<void>
Examples:
agent.aiInput()
Input text into something.
Parameters:
text: string
- The final text content that should be placed in the input element. Use blank string to clear the input.locate: string
- A natural language description of the element to input text into.Return Value:
Promise<void>
Examples:
agent.aiKeyboardPress()
Press a keyboard key.
Parameters:
key: string
- The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.locate?: string
- Optional, a natural language description of the element to press the key on.Return Value:
Promise<void>
Examples:
agent.aiScroll()
Scroll a page or an element.
Parameters:
scrollParam: PlanningActionParamScroll
- The scroll parameter
direction: 'up' | 'down' | 'left' | 'right'
- The direction to scroll.scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'
- Optional, the type of scroll to perform.distance: number
- Optional, the distance to scroll in px.locate?: string
- Optional, a natural language description of the element to scroll on. If not provided, Midscene will perform scroll on the current mouse position.Return Value:
Promise<void>
Examples:
agent.aiQuery()
This method allows you to extract data directly from the UI using multimodal AI reasoning capabilities. Simply define the expected format (e.g., string, number, JSON, or an array) in the dataDemand
, and Midscene will return a result that matches the format.
Parameters:
dataShape: T
: A description of the expected return format.Return Value:
dataDemand
, and Midscene will return a matching result.Examples:
agent.aiAssert()
This method lets you specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional errorMsg
and a detailed reason generated by the AI.
Parameters:
assertion: string
- The assertion described in natural language.errorMsg?: string
- An optional error message to append if the assertion fails.Return Value:
errorMsg
and additional AI-provided information.Example:
Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine .aiQuery
with standard JavaScript assertions instead of using .aiAssert
.
For example, you might replace the above code with:
agent.aiWaitFor()
This method allows you to wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified checkIntervalMs
.
Parameters:
assertion: string
- The condition described in natural language.options?: object
- An optional configuration object containing:
timeoutMs?: number
- Timeout in milliseconds (default: 15000).checkIntervalMs?: number
- Interval for checking in milliseconds (default: 3000).Return Value:
Examples:
Given the time consumption of AI services, .aiWaitFor
might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.
agent.runYaml()
This method executes an automation script written in YAML. Only the tasks
part of the script is executed, and it returns the results of all .aiQuery
calls within the script.
Parameters:
yamlScriptContent: string
- The YAML-formatted script content.Return Value:
result
property that includes the results of all .aiQuery
calls.Example:
For more information about YAML scripts, please refer to Automate with Scripts in YAML.
.reportFile
The path to the report file.
You can override environment variables at runtime by calling the overrideAIConfig
method.
Set the MIDSCENE_DEBUG_AI_PROFILE
variable to view the execution time and usage for each AI call.
LangSmith is a platform for debugging large language models. To integrate LangSmith, follow these steps:
After starting Midscene, you should see logs similar to: