--- url: /API.md --- # API Reference > In the documentation below, you might see function calls prefixed with `agent.`. If you utilize destructuring in Playwright (e.g., `async ({ ai, aiQuery }) => { /* ... */ }`), you can call these functions without the `agent.` prefix. This is merely a syntactical difference. ## Constructors Each Agent in Midscene has its own constructor. * In Puppeteer, use [PuppeteerAgent](/en/integrate-with-puppeteer.md) * In Bridge Mode, use [AgentOverChromeBridge](/en/bridge-mode-by-chrome-extension.md#constructor) These Agents share some common constructor parameters: * `generateReport: boolean`: If true, a report file will be generated. (Default: true) * `autoPrintReportMsg: boolean`: If true, report messages will be printed. (Default: true) * `cacheId: string | undefined`: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled) * `actionContext: string`: Some background knowledge that should be sent to the AI model when calling `agent.aiAction()`, like 'close the cookie consent dialog first if it exists' (Default: undefined) In Puppeteer, there are 3 additional parameters: * `forceSameTabNavigation: boolean`: If true, page navigation is restricted to the current tab. (Default: true) * `waitForNetworkIdleTimeout: number`: The timeout for waiting for network idle between each action. (Default: 2000ms, set to 0 to disable the timeout) * `waitForNavigationTimeout: number`: The timeout for waiting for navigation finished. (Default: 5000ms, set to 0 to disable the timeout) ## Interaction Methods Below are the main APIs available for the various Agents in Midscene. :::info Auto Planning v.s. Instant Action In Midscene, you can choose to use either auto planning or instant action. * `agent.ai()` is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model. * `agent.aiTap()`, `agent.aiHover()`, `agent.aiInput()`, `agent.aiKeyboardPress()`, `agent.aiScroll()` are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform. ::: ### `agent.aiAction()` or `.ai()` This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them. * Type ```typescript function aiAction(prompt: string): Promise; function ai(prompt: string): Promise; // shorthand form ``` * Parameters: * `prompt: string` - A natural language description of the UI steps. * Return Value: * Returns a Promise that resolves to void when all steps are completed; if execution fails, an error is thrown. * Examples: ```typescript // Basic usage await agent.aiAction('Type "JavaScript" into the search box, then click the search button'); // Using the shorthand .ai form await agent.ai('Click the login button at the top of the page, then enter "test@example.com" in the username field'); // When using UI Agent models like ui-tars, you can try a more high-level prompt await agent.aiAction('Post a Tweet "Hello World"'); ``` :::tip Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown. For optimal results, please provide clear and detailed instructions for `agent.aiAction()`. For guides about writing prompts, you may read this doc: [Tips for Writing Prompts](/en/prompting-tips.md). Related Documentation: * [Choose a model](/en/choose-a-model.md) ::: ### `agent.aiTap()` Tap something. * Type ```typescript function aiTap(locate: string, options?: Object): Promise; ``` * Parameters: * `locate: string` - A natural language description of the element to tap. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` * Examples: ```typescript await agent.aiTap('The login button at the top of the page'); // Use deepThink feature to precisely locate the element await agent.aiTap('The login button at the top of the page', { deepThink: true }); ``` ### `agent.aiHover()` Move mouse over something. * Type ```typescript function aiHover(locate: string, options?: Object): Promise; ``` * Parameters: * `locate: string` - A natural language description of the element to hover over. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` * Examples: ```typescript await agent.aiHover('The version number of the current page'); ``` ### `agent.aiInput()` Input text into something. * Type ```typescript function aiInput(text: string, locate: string, options?: Object): Promise; ``` * Parameters: * `text: string` - The final text content that should be placed in the input element. Use blank string to clear the input. * `locate: string` - A natural language description of the element to input text into. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` * Examples: ```typescript await agent.aiInput('Hello World', 'The search input box'); ``` ### `agent.aiKeyboardPress()` Press a keyboard key. * Type ```typescript function aiKeyboardPress(key: string, locate?: string, options?: Object): Promise; ``` * Parameters: * `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported. * `locate?: string` - Optional, a natural language description of the element to press the key on. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` * Examples: ```typescript await agent.aiKeyboardPress('Enter', 'The search input box'); ``` ### `agent.aiScroll()` Scroll a page or an element. * Type ```typescript function aiScroll(scrollParam: PlanningActionParamScroll, locate?: string, options?: Object): Promise; ``` * Parameters: * `scrollParam: PlanningActionParamScroll` - The scroll parameter * `direction: 'up' | 'down' | 'left' | 'right'` - The direction to scroll. * `scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'` - Optional, the type of scroll to perform. * `distance: number` - Optional, the distance to scroll in px. * `locate?: string` - Optional, a natural language description of the element to scroll on. If not provided, Midscene will perform scroll on the current mouse position. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` * Examples: ```typescript await agent.aiScroll({ direction: 'up', distance: 100, scrollType: 'once' }, 'The form panel'); ``` :::tip About the `deepThink` feature The `deepThink` feature is a powerful feature that allows Midscene to call AI model twice to precisely locate the element. It is useful when the AI model find it hard to distinguish the element from its surroundings. ::: ## Data Extraction ### `agent.aiQuery()` This method allows you to extract data directly from the UI using multimodal AI reasoning capabilities. Simply define the expected format (e.g., string, number, JSON, or an array) in the `dataDemand`, and Midscene will return a result that matches the format. * Type ```typescript function aiQuery(dataShape: string | Object): Promise; ``` * Parameters: * `dataShape: T`: A description of the expected return format. * Return Value: * Returns any valid basic type, such as string, number, JSON, array, etc. * Just describe the format in `dataDemand`, and Midscene will return a matching result. * Examples: ```typescript const dataA = await agent.aiQuery({ time: 'The date and time displayed in the top-left corner as a string', userInfo: 'User information in the format {name: string}', tableFields: 'An array of table field names, string[]', tableDataRecord: 'Table records in the format {id: string, [fieldName]: string}[]', }); // You can also describe the expected return format using a string: // dataB will be an array of strings const dataB = await agent.aiQuery('string[], list of task names'); // dataC will be an array of objects const dataC = await agent.aiQuery('{name: string, age: string}[], table data records'); ``` ### `agent.aiBoolean()` Extract a boolean value from the UI. * Type ```typescript function aiBoolean(prompt: string): Promise; ``` * Parameters: * `prompt: string` - A natural language description of the expected value. * Return Value: * Returns a `Promise` when AI returns a boolean value. * Examples: ```typescript const bool = await agent.aiBoolean('Whether there is a login dialog'); ``` ### `agent.aiNumber()` Extract a number value from the UI. * Type ```typescript function aiNumber(prompt: string): Promise; ``` * Parameters: * `prompt: string` - A natural language description of the expected value. * Return Value: * Returns a `Promise` when AI returns a number value. * Examples: ```typescript const number = await agent.aiNumber('The remaining points of the account'); ``` ### `agent.aiString()` Extract a string value from the UI. * Type ```typescript function aiString(prompt: string): Promise; ``` * Parameters: * `prompt: string` - A natural language description of the expected value. * Return Value: * Returns a `Promise` when AI returns a string value. * Examples: ```typescript const string = await agent.aiString('The first item in the list'); ``` ## More APIs ### `agent.aiAssert()` Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional `errorMsg` and a detailed reason generated by the AI. * Type ```typescript function aiAssert(assertion: string, errorMsg?: string): Promise; ``` * Parameters: * `assertion: string` - The assertion described in natural language. * `errorMsg?: string` - An optional error message to append if the assertion fails. * Return Value: * Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with `errorMsg` and additional AI-provided information. * Example: ```typescript await agent.aiAssert('The price of "Sauce Labs Onesie" is 7.99'); ``` :::tip Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine `.aiQuery` with standard JavaScript assertions instead of using `.aiAssert`. For example, you might replace the above code with: ```typescript const items = await agent.aiQuery( '"{name: string, price: number}[], return product names and prices' ); const onesieItem = items.find(item => item.name === 'Sauce Labs Onesie'); expect(onesieItem).toBeTruthy(); expect(onesieItem.price).toBe(7.99); ``` ::: ### `agent.aiLocate()` Locate an element using natural language. * Type ```typescript function aiLocate(locate: string, options?: Object): Promise<{ rect: { left: number; top: number; width: number; height: number; }; center: [number, number]; }>; ``` * Parameters: * `locate: string` - A natural language description of the element to locate. * `options?: Object` - Optional, a configuration object containing: * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. * Return Value: * Returns a `Promise` when the element is located parsed as an locate info object. * Examples: ```typescript const locateInfo = await agent.aiLocate('The login button at the top of the page'); console.log(locateInfo); ``` ### `agent.aiWaitFor()` Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified `checkIntervalMs`. * Type ```typescript function aiWaitFor( assertion: string, options?: { timeoutMs?: number; checkIntervalMs?: number; } ): Promise; ``` * Parameters: * `assertion: string` - The condition described in natural language. * `options?: object` - An optional configuration object containing: * `timeoutMs?: number` - Timeout in milliseconds (default: 15000). * `checkIntervalMs?: number` - Interval for checking in milliseconds (default: 3000). * Return Value: * Returns a Promise that resolves to void if the condition is met; if not, an error is thrown when the timeout is reached. * Examples: ```typescript // Basic usage await agent.aiWaitFor("There is at least one headphone information displayed on the interface"); // Using custom options await agent.aiWaitFor("The shopping cart icon shows a quantity of 2", { timeoutMs: 30000, // Wait for 30 seconds checkIntervalMs: 5000 // Check every 5 seconds }); ``` :::tip Given the time consumption of AI services, `.aiWaitFor` might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative. ::: ### `agent.runYaml()` Execute an automation script written in YAML. Only the `tasks` part of the script is executed, and it returns the results of all `.aiQuery` calls within the script. * Type ```typescript function runYaml(yamlScriptContent: string): Promise<{ result: any }>; ``` * Parameters: * `yamlScriptContent: string` - The YAML-formatted script content. * Return Value: * Returns an object with a `result` property that includes the results of all `.aiQuery` calls. * Example: ```typescript const { result } = await agent.runYaml(` tasks: - name: search weather flow: - ai: input 'weather today' in input box, click search button - sleep: 3000 - name: query weather flow: - aiQuery: "the result shows the weather info, {description: string}" `); console.log(result); ``` :::tip For more information about YAML scripts, please refer to [Automate with Scripts in YAML](/en/automate-with-scripts-in-yaml.md). ::: ### `agent.setAIActionContext()` Set the background knowledge that should be sent to the AI model when calling `agent.aiAction()`. * Type ```typescript function setAIActionContext(actionContext: string): void; ``` * Parameters: * `actionContext: string` - The background knowledge that should be sent to the AI model. * Example: ```typescript await agent.setAIActionContext('Close the cookie consent dialog first if it exists'); ``` ### `agent.evaluateJavaScript()` Evaluate a JavaScript expression in the web page context. * Type ```typescript function evaluateJavaScript(script: string): Promise; ``` * Parameters: * `script: string` - The JavaScript expression to evaluate. * Return Value: * Returns the result of the JavaScript expression. * Example: ```typescript const result = await agent.evaluateJavaScript('document.title'); console.log(result); ``` ## Properties ### `.reportFile` The path to the report file. ## Additional Configurations ### Setting Environment Variables at Runtime You can override environment variables at runtime by calling the `overrideAIConfig` method. ```typescript import { overrideAIConfig } from '@midscene/web/puppeteer'; // or another Agent overrideAIConfig({ OPENAI_BASE_URL: "...", OPENAI_API_KEY: "...", MIDSCENE_MODEL_NAME: "..." }); ``` ### Print usage information for each AI call Set the `DEBUG=midscene:ai:profile:stats` to view the execution time and usage for each AI call. ```bash export DEBUG=midscene:ai:profile:stats ``` ### Customize the run artifact directory Set the `MIDSCENE_RUN_DIR` variable to customize the run artifact directory. ```bash export MIDSCENE_RUN_DIR=midscene_run # The default value is the midscene_run in the current working directory, you can set it to an absolute path or a relative path ``` ### Using LangSmith LangSmith is a platform for debugging large language models. To integrate LangSmith, follow these steps: ```bash # Set environment variables # Enable debug mode export MIDSCENE_LANGSMITH_DEBUG=1 # LangSmith configuration export LANGSMITH_TRACING_V2=true export LANGSMITH_ENDPOINT="https://api.smith.langchain.com" export LANGSMITH_API_KEY="your_key_here" export LANGSMITH_PROJECT="your_project_name_here" ``` After starting Midscene, you should see logs similar to: ```log DEBUGGING MODE: langsmith wrapper enabled ``` --- url: /automate-with-scripts-in-yaml.md --- # Automate with Scripts in YAML In most cases, developers write automation just to perform some smoke tests, like checking the appearance of some content, or verifying that the key user path is accessible. Maintaining a large test project is unnecessary in this situation. โ Midscene offers a way to do this kind of automation with `.yaml` files, which helps you to focus on the script itself instead of the test infrastructure. Any team member can write an automation script without learning any API. Here is an example of `.yaml` script, you may have already understood how it works by reading its content. ```yaml web: url: https://www.bing.com tasks: - name: search weather flow: - ai: search for 'weather today' - sleep: 3000 - name: check result flow: - aiAssert: the result shows the weather info ``` :::info Demo Project You can find the demo project with YAML scripts [https://github.com/web-infra-dev/midscene-example/tree/main/yaml-scripts-demo](https://github.com/web-infra-dev/midscene-example/tree/main/yaml-scripts-demo) * [Web](https://github.com/web-infra-dev/midscene-example/tree/main/yaml-scripts-demo) * [Android](https://github.com/web-infra-dev/midscene-example/tree/main/android/yaml-scripts-demo) ::: ## Setup AI model service Set your model configs into the environment variables. You may refer to [choose a model](/en/choose-a-model.md) for more details. ```bash # replace with your own export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz" ``` or you can use a `.env` file locate at the same directory as you run the command to store the configuration, Midscene command line tool will automatically load it. ```env filename=.env OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz" ``` ## Install Command Line Tool Install `@midscene/cli` globally ```bash npm i -g @midscene/cli # or if you prefer a project-wide installation npm i @midscene/cli --save-dev ``` Write a yaml file to `bing-search.yaml` to automate in web browser : ```yaml web: url: https://www.bing.com tasks: - name: search weather flow: - ai: search for 'weather today' - sleep: 3000 - aiAssert: the result shows the weather info ``` or to automate in Android device connected by adb : ```yaml android: # launch: https://www.bing.com deviceId: s4ey59 tasks: - name: search weather flow: - ai: open browser and navigate to bing.com - ai: search for 'weather today' - sleep: 3000 - aiAssert: the result shows the weather info ``` Run this script ```bash midscene ./bing-search.yaml # or if you installed midscene inside the project npx midscene ./bing-search.yaml ``` You should see that the output shows the progress of the running process and the report file. ## Command line usage ### Run single `.yaml` file ```bash midscene /path/to/yaml ``` ### Run all `.yaml` files under a folder ```bash midscene /dir/of/yaml/ # glob is also supported midscene /dir/**/yaml/ ``` ## YAML file schema There are two parts in a `.yaml` file, the `web/android` and the `tasks`. The `web/android` part defines the basic of a task. Use `web` parameter (also previously named as `target`) for web browser automation, and use `android` parameter for Android device automation. They are mutually exclusive. ### The `web` part ```yaml web: # The URL to visit, required. If `serve` is provided, provide the path to the file to visit url: # Serve the local path as a static server, optional serve: # The user agent to use, optional userAgent: # number, the viewport width, default is 1280, optional viewportWidth: # number, the viewport height, default is 960, optional viewportHeight: # number, the device scale factor (dpr), default is 1, optional deviceScaleFactor: # string, the path to the json format cookie file, optional cookie: # object, the strategy to wait for network idle, optional waitForNetworkIdle: # number, the timeout in milliseconds, 2000ms for default, optional timeout: # boolean, continue on network idle error, true for default continueOnNetworkIdleError: # string, the path to save the aiQuery result, optional output: # boolean, if limit the popup to the current page, true for default in yaml script forceSameTabNavigation: # string, the bridge mode to use, optional, default is false, can be 'newTabWithUrl' or 'currentTab'. More details see the following section bridgeMode: false | 'newTabWithUrl' | 'currentTab' # boolean, if close the new tabs after the bridge is disconnected, optional, default is false closeNewTabsAfterDisconnect: # boolean, if allow insecure https certs, optional, default is false acceptInsecureCerts: # string, the background knowledge to send to the AI model when calling aiAction, optional aiActionContext: ``` ### The `android` part ```yaml android: # The device id to use, optional, default is the first connected device deviceId: # The url to launch, optional, default is the current page launch: ``` ### The `tasks` part The `tasks` part is an array indicates the tasks to do. Remember to write a `-` before each item which means an array item. The interfaces of the `flow` part are almost the same as the [API](/en/API.md), except for some parameter levels. ```yaml tasks: - name: continueOnError: # optional, default is false flow: # Auto Planning (.ai) # ---------------- # perform an action, this is the shortcut for aiAction - ai: # this is the same as ai - aiAction: # Instant Action(.aiTap, .aiHover, .aiInput, .aiKeyboardPress, .aiScroll) # ---------------- # tap an element located by prompt - aiTap: deepThink: # optional, whether to use deepThink to precisely locate the element # hover an element located by prompt - aiHover: deepThink: # optional, whether to use deepThink to precisely locate the element # input text into an element located by prompt - aiInput: locate: deepThink: # optional, whether to use deepThink to precisely locate the element # press a key (like Enter, Tab, Escape, etc.) on an element located by prompt - aiKeyboardPress: locate: deepThink: # optional, whether to use deepThink to precisely locate the element # scroll globally or on an element located by prompt - aiScroll: direction: 'up' # or 'down' | 'left' | 'right' scrollType: 'once' # or 'untilTop' | 'untilBottom' | 'untilLeft' | 'untilRight' distance: # optional, distance to scroll in px locate: # optional, the element to scroll on deepThink: # optional, whether to use deepThink to precisely locate the element # Data Extraction # ---------------- # perform a query, return a json object - aiQuery: # remember to describe the format of the result in the prompt name: # the name of the result, will be used as the key in the output json # More APIs # ---------------- # wait for a condition to be met with a timeout (ms, optional, default 30000) - aiWaitFor: timeout: # perform an assertion - aiAssert: # sleep for a number of milliseconds - sleep: # evaluate a javascript expression in web page context - javascript: name: # assign a name to the return value, will be used as the key in the output json, optional - name: flow: # ... ``` ## More features ### Use environment variables in `.yaml` file You can use environment variables in `.yaml` file by `${variable-name}`. For example, if you have a `.env` file with the following content: ```env filename=.env topic=weather today ``` You can use the environment variable in the `.yaml` file like this: ```yaml #... - ai: type ${topic} in input box #... ``` ### Debug in headed mode > `web` scenario only 'headed mode' means the browser will be visible. The default behavior is to run in headless mode. To turn on headed mode, you can use `--headed` option. Besides, if you want to keep the browser window open after the script finishes, you can use `--keep-window` option. `--keep-window` implies `--headed`. When running in headed mode, it will consume more resources, so we recommend you to use it locally only when needed. ```bash # run in headed mode midscene /path/to/yaml --headed # run in headed mode and keep the browser window open after the script finishes midscene /path/to/yaml --keep-window ``` ### Use bridge mode > `web` scenario only > By using bridge mode, you can utilize YAML scripts to automate the web browser on your desktop. This is particularly useful if you want to reuse cookies, plugins, and page states, or if you want to manually interact with automation scripts. To use bridge mode, you should install the Chrome extension first, and use this configuration in the `target` section: ```diff web: url: https://www.bing.com + bridgeMode: newTabWithUrl ``` See [Bridge Mode by Chrome Extension](/en/bridge-mode-by-chrome-extension.md) for more details. ### Run yaml script with javascript You can also run a yaml script with javascript by using the [`runYaml`](/en/api.md#runyaml) method of the Midscene agent. Only the `tasks` part of the yaml script will be executed. ## FAQ **How to get cookies in JSON format from Chrome?** You can use this [chrome extension](https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc) to export cookies in JSON format. ## More You may also be interested in [Prompting Tips](/en/prompting-tips.md) --- url: /blog-introducing-instant-actions-and-deep-think.md --- # Introducing Instant Actions and Deep Think From Midscene v0.14.0, we have introduced two new features: Instant Actions and Deep Think. ## Instant Actions - A More Predictable Way to Perform Actions You may have already been familiar with our `.ai` interface. It's an auto-planning interface to interact with web pages. For example, when performing a search, you can do this: ```typescript await agent.ai('type "Headphones" in search box, hit Enter'); ``` Behind the scene, Midscene will call the LLM to plan the steps and execute them. You can see the report file to see the process. It's a very common way for AI agents to these kinds of tasks. ![](/blog/report-planning.png) In the meantime, there are many testing engineers who want a faster way to perform actions. When using AI models with complex prompts, some of the LLM models may find it hard to plan the proper steps, or the coordinates of the elements may not be accurate. It could be frustrating for debugging the unpredictable process. To solve this problem, we have introduced the `aiTap()`, `aiHover()`, `aiInput()`, `aiKeyboardPress()`, `aiScroll()` interfaces. They are call the **"instant actions"**. These interfaces will directly perform the specified action as the interface name suggests, while the AI model is responsible for the easier tasks such as locating elements. The whole process can be obviously faster and more reliable after using them. For example, the search action above can be rewritten as: ```typescript await agent.aiInput('Headphones', 'search-box'); await agent.aiKeyboardPress('Enter'); ``` The typical workflow in the report file is like this, as you can see there is no planning process in the report file: ![](/blog/report-instant-action.png) The scripts with instant actions seems a little bit redundant (or not 'ai-style'), but we believe these structured interfaces are a good way to save time debugging when the action is already clear. ## Deep Think - A More Accurate Way to Locate Elements When using Midscene with some complex widgets, the LLM may find it hard to locate the target element. We have introduced a new option named `deepThink` to the instant actions. The signature of the instant actions with `deepThink` is like this: ```typescript await agent.aiTap('target', { deepThink: true }); ``` `deepThink` is a strategy of locating elements. It will first find an area that contains the target element, then "focus" on this area to search the element again. By this way, the coordinates of the target element will be more accurate. Let's take the workflow editor page of Coze.com as an example. There are many customized icons on the sidebar. This is usually hard for LLMs to distinguish the target element from its surroundings. ![](/blog/coze-sidebar.png) After using `deepThink` in instant actions, the yaml scripts will be like this (of course, you can also use the javascript interface): ```yaml tasks: - name: edit input panel flow: - aiTap: the triangle icon on the left side of the text "Input" deepThink: true - aiTap: the first checkbox in the Input form deepThink: true - aiTap: the expand button on the second row of the Input form (on the right of the checkbox) deepThink: true - aiTap: the delete button on the second last row of the Input form deepThink: true - aiTap: the add button on the last row of the Input form ๏ผˆsecond button from the right๏ผ‰ deepThink: true ``` By viewing the report file, you can see Midscene has found every target element in the area. ![](/blog/report-coze-deep-think.png) Just like the example above, the highly-detailed prompt for `deepThink` adheres to [the prompting tips](./prompting-tips). This is always the key to make result stable. `deepThink` is only available with the models that support visual grounding like qwen2.5-vl. If you are using LLM models like gpt-4o, it won't work. --- url: /blog-support-android-automation.md --- # Support Android Automation From Midscene v0.15, we are happy to announce the support for Android automation. The era for AI-driven Android automation is here! ## Showcases ### Navigation to attraction Open Maps, search for a destination, and navigate to it. ### Auto-like tweets Open Twitter, auto-like the first tweet by [@midscene\_ai](https://x.com/midscene_ai). ## Suitable for ALL apps For our developers, all you need is the adb connection and a visual-language model (vl-model) service. Everything is ready! Behind the scenes, we utilize the visual grounding capabilities of vl-model to locate target elements on the screen. So, regardless of whether it's a native app, a [Lynx](https://github.com/lynx-family/lynx) page, or a hybrid app with a webview, it makes no difference. Developers can write automation scripts without the burden of worrying about the technology stack of the app. ## With ALL the power of Midscene When using Midscene to do web automation, our users loves the tools like playgrounds and reports. Now, we bring the same power to Android automation! ### Use the playground to run automation without any code ### Use the report to replay the whole process ### Write the automation scripts by yaml file Connect to the device, open ebay.com, and get some items info. ```yaml # search headphone on ebay, extract the items info into a json file, and assert the shopping cart icon android: deviceId: s4ey59 tasks: - name: search headphones flow: - aiAction: open browser and navigate to ebay.com - aiAction: type 'Headphones' in ebay search box, hit Enter - sleep: 5000 - aiAction: scroll down the page for 800px - name: extract headphones info flow: - aiQuery: > {name: string, price: number, subTitle: string}[], return item name, price and the subTitle on the lower right corner of each item name: headphones - name: assert Filter button flow: - aiAssert: There is a Filter button on the page ``` ### Use the javascript SDK Use the javascript SDK to do the automation by code. ```ts import { AndroidAgent, AndroidDevice, getConnectedDevices } from '@midscene/android'; import "dotenv/config"; // read environment variables from .env file const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); Promise.resolve( (async () => { const devices = await getConnectedDevices(); const page = new AndroidDevice(devices[0].udid); // ๐Ÿ‘€ init Midscene agent const agent = new AndroidAgent(page,{ aiActionContext: 'If any location, permission, user agreement, etc. popup, click agree. If login page pops up, close it.', }); await page.connect(); await page.launch('https://www.ebay.com'); await sleep(5000); // ๐Ÿ‘€ type keywords, perform a search await agent.aiAction('type "Headphones" in search box, hit Enter'); // ๐Ÿ‘€ wait for the loading await agent.aiWaitFor("there is at least one headphone item on page"); // or you may use a plain sleep: // await sleep(5000); // ๐Ÿ‘€ understand the page content, find the items const items = await agent.aiQuery( "{itemTitle: string, price: Number}[], find item in list and corresponding price" ); console.log("headphones in stock", items); // ๐Ÿ‘€ assert by AI await agent.aiAssert("There is a category filter on the left"); })() ); ``` ### Two style APIs to do interaction The auto-planning style: ```javascript await agent.ai('input "Headphones" in search box, hit Enter'); ``` The instant action style: ```javascript await agent.aiInput('Headphones', 'search box'); await agent.aiKeyboardPress('Enter'); ``` ## Quick start You can use the playground to experience the Android automation without any code. Please refer to [Quick experience with Android](/en/quick-experience-with-android.md) for more details. After the experience, you can integrate with the Android device by javascript code. Please refer to [Integrate with Android(adb)](/en/integrate-with-android.md) for more details. If you prefer the yaml file for automation scripts, please refer to [Automate with scripts in yaml](/en/automate-with-scripts-in-yaml.md). ### Demo projects We have prepared a demo project for javascript SDK: [JavaScript demo project](https://github.com/web-infra-dev/midscene-example/blob/main/android/javascript-sdk-demo) If you want to use the automation for testing purpose, you can use the javascript with vitest. We have setup a demo project for you to see how it works: [Vitest demo project](https://github.com/web-infra-dev/midscene-example/blob/main/android/vitest-demo) You can also write the automation scripts by yaml file: [YAML demo project](https://github.com/web-infra-dev/midscene-example/blob/main/android/yaml-scripts-demo) ## Limitations 1. Caching feature for element locator is not supported. Since no view-hierarchy is collected, we cannot cache the element identifier and reuse it. 2. LLMs like gpt-4o or deepseek are not supported. Only some known vl models with visual grounding ability are supported for now. If you want to introduce other vl models, please let us know. 3. The performance is not good enough for now. We are still working on it. 4. The vl model may not perform well on `.aiQuery` and `.aiAssert`. We will give a way to switch model for different kinds of tasks. 5. Due to some security restrictions, you may got a blank screenshot for the password input and Midscene will not be able to work for that. ## Credits We would like to thank the following projects: * [scrcpy](https://github.com/Genymobile/scrcpy) and [yume-chan](https://github.com/yume-chan) allow us to control Android devices with browser. * [appium-adb](https://github.com/appium/appium-adb) for the javascript bridge of adb. * [YADB](https://github.com/ysbing/YADB) for the yadb tool which improves the performance of text input. --- url: /bridge-mode-by-chrome-extension.md --- # Bridge Mode by Chrome Extension The bridge mode in the Midscene Chrome extension is a tool that allows you to use local scripts to control the desktop version of Chrome. Your scripts can connect to either a new tab or the currently active tab. Using the desktop version of Chrome allows you to reuse all cookies, plugins, page status, and everything else you want. You can work with automation scripts to complete your tasks. This mode is commonly referred to as 'man-in-the-loop' in the context of automation. ![bridge mode](/midscene-bridge-mode.jpg) :::info Demo Project check the demo project of bridge mode: [https://github.com/web-infra-dev/midscene-example/blob/main/bridge-mode-demo](https://github.com/web-infra-dev/midscene-example/blob/main/bridge-mode-demo) ::: ## Setup AI model service Set your model configs into the environment variables. You may refer to [choose a model](/en/choose-a-model.md) for more details. ```bash # replace with your own export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz" ``` > In bridge mode, the AI models configs should be set in the Node.js side instead of the browser side. ## Step 1. Install Midscene extension from Chrome web store Install [Midscene extension from Chrome web store](https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief) ## Step 2. install dependencies ## Step 3. write scripts Write and save the following code as `./demo-new-tab.ts`. ```typescript import { AgentOverChromeBridge } from "@midscene/web/bridge-mode"; const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); Promise.resolve( (async () => { const agent = new AgentOverChromeBridge(); // This will connect to a new tab on your desktop Chrome // remember to start your chrome extension, click 'allow connection' button. Otherwise you will get an timeout error await agent.connectNewTabWithUrl("https://www.bing.com"); // these are the same as normal Midscene agent await agent.ai('type "AI 101" and hit Enter'); await sleep(3000); await agent.aiAssert("there are some search results"); await agent.destroy(); })() ); ``` ## Step 4. start the chrome extension Start the chrome extension and switch to 'Bridge Mode' tab. Click "Allow connection". ## Step 5. run the script Run your scripts ```bash tsx demo-new-tab.ts ``` After executing the script, you should see the status of the Chrome extension switched to 'connected', and a new tab has been opened. Now this tab is controlled by your scripts. :::tip โ Whether the scripts are run before or after clicking 'Allow connection' in the browser is not significant. ::: ## Constructor ```typescript import { AgentOverChromeBridge } from "@midscene/web/bridge-mode"; const agent = new AgentOverChromeBridge(); ``` Except [the normal parameters in the agent constructor](/en/api.md), `AgentOverChromeBridge` accepts one more parameter: * `closeNewTabsAfterDisconnect?: boolean`: If true, the newly created tab will be closed when the bridge is destroyed. Default is false. ## Methods Except [the normal agent interface](/en/api.md), `AgentOverChromeBridge` provides some other interfaces to control the desktop Chrome. :::info You should always call `connectCurrentTab` or `connectNewTabWithUrl` before doing further actions. Each of the agent instance can only connect to one tab instance, and it cannot be reconnected after destroy. ::: ### `connectCurrentTab()` Connect to the currently active tab. * Type ```typescript function connectCurrentTab(options?: { forceSameTabNavigation?: boolean }): Promise; ``` * Parameters: * `options?: object` - Optional configuration object * `forceSameTabNavigation?: boolean` - If true (default), restricts pages from opening new tabs, forcing new pages to open in the current tab to prevent AI operation failures due to manual tab switching. This configuration usually doesn't need to be changed * Returns: * Returns a Promise that resolves to void when connected successfully; throws an error if connection fails * Example: ```typescript try { await agent.connectCurrentTab(); console.log('Successfully connected to current tab'); } catch (err) { console.error('Connection failed:', err.message); } ``` ### `connectNewTabWithUrl()` Create a new tab and connect to it immediately. * Type ```typescript function connectNewTabWithUrl( url: string, options?: { forceSameTabNavigation?: boolean } ): Promise; ``` * Parameters: * `url: string` - URL to open in the new tab * `options?: object` - Optional configuration object (same parameters as connectCurrentTab) * Returns: * Returns a Promise that resolves to void when connected successfully; throws an error if connection fails * Example: ```typescript // Open Bing and wait for connection await agent.connectNewTabWithUrl( "https://www.bing.com", { forceSameTabNavigation: false } ); ``` ### `destroy()` Destroy the connection and release resources. * Type ```typescript function destroy(closeNewTabsAfterDisconnect?: boolean): Promise; ``` * Parameters: * `closeNewTabsAfterDisconnect?: boolean` - If true, the newly created tab will be closed when the bridge is destroyed. Default is false. The will override the `closeNewTabsAfterDisconnect` parameter in the constructor. * Returns: * Returns a Promise that resolves to void when destruction completes * Example: ```typescript // Destroy connection after completing operations await agent.ai('Perform final operation...'); await agent.destroy(); ``` ## Use bridge mode in yaml-script [Yaml scripts](/en/automate-with-scripts-in-yaml.md) is a way for developers to write automation scripts in yaml format, which is easy to read and write comparing to javascript. To use bridge mode in yaml script, set the `bridgeMode` property in the `target` section. If you want to use the current tab, set it to `currentTab`, otherwise set it to `newTabWithUrl`. Set `closeNewTabsAfterDisconnect` to true if you want to close the newly created tabs when the bridge is destroyed. This is optional and the default value is false. For example, the following script will open a new tab by Chrome extension bridge: ```diff target: url: https://www.bing.com + bridgeMode: newTabWithUrl + closeNewTabsAfterDisconnect: true tasks: ``` Run the script: ```bash midscene ./bing.yaml ``` Remember to start the chrome extension and click 'Allow connection' button after the script is running. ### Unsupported options In bridge mode, these options will be ignored (they will follow your desktop browser's settings): * userAgent * viewportWidth * viewportHeight * viewportScale * waitForNetworkIdle * cookie ## FAQ * Where should I config the `OPENAI_API_KEY`, in the browser or in the terminal? When using bridge mode, you should config the `OPENAI_API_KEY` in the terminal. --- url: /caching.md --- # Caching Midscene supports caching the planning steps and matched DOM element information of AI to reduce the call of AI models and improve the execution efficiency. **Effect** After enabling the cache, the execution time is significantly reduced, for example, from 39s to 13s. * **before using cache, 39s** ![](/cache/no-cache-time.png) * **after using cache, 13s** ![](/cache/use-cache-time.png) ## Instructions There are two key points to use caching: 1. Set `MIDSCENE_CACHE=1` in the environment variable. 2. Set `cacheId` to specify the cache file name. It's automatically set in Playwright and Yaml mode, but if you are using javascript SDK, you should set it manually. ### Playwright In playwright mode, you can use the `MIDSCENE_CACHE=1` environment variable to enable caching. The `cacheId` will be automatically set to the test file name. ```diff - playwright test --config=playwright.config.ts + MIDSCENE_CACHE=1 playwright test --config=playwright.config.ts ``` ### Javascript agent, like PuppeteerAgent, AgentOverChromeBridge Enable caching by setting the `MIDSCENE_CACHE=1` environment variable. And also, you should set the `cacheId` to specify the cache identifier. ```diff - tsx demo.ts + MIDSCENE_CACHE=1 tsx demo.ts ``` ```javascript const mid = new PuppeteerAgent(originPage, { cacheId: 'puppeteer-swag-sab', // specify cache id }); ``` ### Yaml Enable caching by setting the `MIDSCENE_CACHE=1` environment variable. The `cacheId` will be automatically set to the yaml filename. ```diff - npx midscene ./bing-search.yaml + # Add cache identifier, cacheId is the yaml filename + MIDSCENE_CACHE=1 npx midscene ./bing-search.yaml ``` ## Cache strategy ### Cache content These two types of content will be cached: 1. the results of AI's planning (i.e., the results of ai and aiAction methods) 2. The results of AI's element locating The results of `aiQuery` and `aiAssert` will never be cached. You can always use them to verify whether the AI's task is as expected. ### Cache hit conditions Cache will only be hit when the following conditions are met: 1. The same `cacheId` 2. The same major and minor version of Midscene 3. The same page url and screen size When using cache for locate element tasks, the following conditions are also required: 1. A DOM element with the same position and size can be found on the page according to the cache file. 2. If you are using VL model, there must be a DOM element matched with the coordinates. Otherwise, you will see a "POSITION NODE" in report file which means it cannot be cached. ### If cache is not hit If cache is not hit, Midscene will call AI model again and the result in cache file will be updated. ## Common issues ### Why the cache is missed on CI? You should commit the cache file to the repository (usually in the `./midscene_run/cache` directory). And also, check the cache-hit conditions. ### Does it mean that AI services are no longer needed after using cache? No. Caching is not a tool for ensuring long-term script stability. We have noticed many scenarios where the cache may miss when the page changes, such as when the element position changes slightly or the DOM structure changes. AI services are still needed to reevaluate the task when the cache miss occurs. ### How to manually remove the cache? You can remove the cache file in the `cache` directory, or edit the contents in the cache file. --- url: /choose-a-model.md --- # Choose a Model In this article, we will talk about what kind of models are supported by Midscene.js and the features of each model. ## Quick Start for using Midscene.js Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner. If you want to see the detailed configuration of model services, see [Config Model and Provider](./model-provider). ### GPT-4o (can't be used in Android automation) ```bash OPENAI_API_KEY="......" OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # optional, if you want an endpoint other than the default one from OpenAI. MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # optional. The default is "gpt-4o". ``` ### Qwen-2.5-VL on Openrouter or Aliyun After applying for the API key on [Openrouter](https://openrouter.ai) or [Aliyun](https://aliyun.com), you can use the following config: ```bash # openrouter.ai export OPENAI_BASE_URL="https://openrouter.ai/api/v1" export OPENAI_API_KEY="......" export MIDSCENE_MODEL_NAME="qwen/qwen2.5-vl-72b-instruct" export MIDSCENE_USE_QWEN_VL=1 # or from Aliyun.com OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" export OPENAI_API_KEY="......" MIDSCENE_MODEL_NAME="qwen-vl-max-latest" MIDSCENE_USE_QWEN_VL=1 ``` ### Gemini-2.5-Pro on Google Gemini After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config: ```bash OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/" OPENAI_API_KEY="......" MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-03-25" MIDSCENE_USE_GEMINI=1 ``` ### UI-TARS on volcengine.com You can use `doubao-1.5-ui-tars` on [Volcengine](https://www.volcengine.com): ```bash OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3" OPENAI_API_KEY="...." MIDSCENE_MODEL_NAME="ep-2025..." MIDSCENE_USE_VLM_UI_TARS=DOUBAO ``` ## Models in Depth Midscene supports two types of models, which are: 1. **general-purpose multimodal LLMs**: Models that can understand text and image input. *GPT-4o* is this kind of model. 2. **models with visual grounding capabilities (VL models)**: Besides the ability to understand text and image input, these models can also return the coordinates of target elements on the page. We have adapted *Qwen-2.5-VL-72B*, *Gemini-2.5-Pro* and *UI-TARS* as VL models. And we are primarily concerned with two features of the model: 1. The ability to understand the screenshot and *plan* the steps to achieve the goal. 2. The ability to *locate* the target elements on the page. The main difference between different models is the way they handle the *locating* capability. When using LLMs like GPT-4o, locating is accomplished through the model's understanding of the UI hierarchy tree and the markup on the screenshot, which consumes more tokens and does not always yield accurate results. In contrast, when using VL models, locating relies on the model's visual grounding capabilities, providing a more native and reliable solution in complex situations. In the Android automation scenario, we decided to use the VL models since the infrastructure of the App in the real world is so complex that we don't want to do any adaptive work on the App UI stack any more. The VL models can provide us with more reliable results, and it should be a better approach to this type of work. ## The Recommended Models ### GPT-4o GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred. **Features** - **Easy to achieve**: you can get the stable API service from many providers and just pay for the token. - **Performing steadily**: it performs well on interaction, assertion, and query. **Limitations when used in Midscene.js** - **High token cost**: dom tree and screenshot will be sent together to the model. For example, it will use 6k input tokens for ebay homepage under 1280x800 resolution, and 9k for search result page. As a result, the cost will be higher than other models. And it will also take longer time to generate the response. - **Content limitation**: it will not work if the target element is inside a cross-origin `