Changelog

For the complete changelog, please refer to: Midscene Releases

v0.19.0 - Support for Getting Complete Execution Process Data

New API for Getting Midscene Execution Process Data

Add the _unstableLogContent API to the agent. Get the execution process data of Midscene, including the time of each step, the AI tokens consumed, and the screenshot.

The report is generated based on this data, which means you can customize your own report using this data.

Read more: API documentation

CLI Support for Adjusting Midscene Env Variable Priority

By default, dotenv does not override the global environment variables in the .env file. If you want to override, you can use the --dotenv-override option.

Read more: Use YAML-based Automation Scripts

Reduce Report File Size

Reduce the size of the generated report by trimming redundant data, significantly reducing the report file size for complex pages. The typical report file size for complex pages has been reduced from 47.6M to 15.6M!

v0.18.0 - enhanced reporting features

🚀 Midscene has another update! It makes your testing and automation processes even more powerful:

Custom Node in Report

  • Add the logScreenshot API to the agent. Take a screenshot of the current page as a report node, and support setting the node title and description to make the automated testing process more intuitive. Applicable for capturing screenshots of key steps, error status capture, UI validation, etc.

  • Example:
test('login github', async ({ ai, aiAssert, aiInput, logScreenshot }) => {
  if (CACHE_TIME_OUT) {
    test.setTimeout(200 * 1000);
  }
  await ai('Click the "Sign in" button');
  await aiInput('quanru', 'username');
  await aiInput('123456', 'password');

  // log by your own
  await logScreenshot('Login page', {
    content: 'Username is quanru, password is 123456',
  });

  await ai('Click the "Sign in" button');
  await aiAssert('Login success');
});

Support for Downloading Reports as Videos

  • Support direct video download from the report player, just by clicking the download button on the player interface.

  • Applicable scenarios: Share test results, archive reproduction steps, and demonstrate problem reproduction.

More Android Configurations Exposed

  • Optimize input interactions in Android apps and allow connecting to remote Android devices

    • autoDismissKeyboard?: boolean - Optional parameter. Whether to automatically dismiss the keyboard after entering text. The default value is true.

    • androidAdbPath?: string - Optional parameter. Used to specify the path of the adb executable file.

    • remoteAdbHost?: string - Optional parameter. Used to specify the remote adb host.

    • remoteAdbPort?: number - Optional parameter. Used to specify the remote adb port.

  • Examples:

await agent.aiInput('Search Box', 'Test Content', { autoDismissKeyboard: true })
const agent = await agentFromAdbDevice('s4ey59', {
    autoDismissKeyboard: false, // Optional parameter. Whether to automatically dismiss the keyboard after entering text. The default value is true.
    androidAdbPath: '/usr/bin/adb', // Optional parameter. Used to specify the path of the adb executable file.
    remoteAdbHost: '192.168.10.1', // Optional parameter. Used to specify the remote adb host.
    remoteAdbPort: '5037' // Optional parameter. Used to specify the remote adb port.
})

Upgrade now to experience these powerful new features!

v0.17 - Let AI See the DOM of the Page

Data Query API Enhanced

To meet more automation and data extraction scenarios, the following APIs have been enhanced with the options parameter, supporting more flexible DOM information and screenshots:

  • agent.aiQuery(dataDemand, options)
  • agent.aiBoolean(prompt, options)
  • agent.aiNumber(prompt, options)
  • agent.aiString(prompt, options)

Newoptions parameter

  • domIncluded: Whether to pass the simplified DOM information to AI model, default is off. This is useful for extracting attributes that are not visible on the page, like image links.
  • screenshotIncluded: Whether to pass the screenshot to AI model, default is on.

Code Example

// Extract all contact information (including hidden avatarUrl attributes)
const contactsData = await agent.aiQuery(
  "{name: string, id: number, company: string, department: string, avatarUrl: string}[], extract all contact information including hidden avatarUrl attributes",
  { domIncluded: true }
);

// Check if the id attribute of the first contact is 1
const isId1 = await agent.aiBoolean(
  "Is the first contact's id is 1?",
  { domIncluded: true }
);

// Get the ID of the first contact (hidden attribute)
const firstContactId = await agent.aiNumber("First contact's id?", { domIncluded: true });

// Get the avatar URL of the first contact (invisible attribute on the page)
const avatarUrl = await agent.aiString(
  "What is the Avatar URL of the first contact?",
  { domIncluded: true }
);

New Right-Click Ability

Have you ever encountered a scenario where you need to automate a right-click operation? Now, Midscene supports a new agent.aiRightClick() method!

Function

Perform a right-click operation on the specified element, suitable for scenarios where right-click events are customized on web pages. Please note that Midscene cannot interact with the browser's native context menu after right-click.

Parameter Description

  • locate: Describe the element you want to operate in natural language
  • options: Optional, supports deepThink (AI fine-grained positioning) and cacheable (result caching)

Example

// Right-click on a contact in the contacts application, triggering a custom context menu
await agent.aiRightClick("Alice Johnson");

// Then you can click on the options in the menu
await agent.aiTap("Copy Info"); // Copy contact information to the clipboard

A Complete Example

In this report file, we show a complete example of using the new aiRightClick API and new query parameters to extract contact data including hidden attributes.

Report file: puppeteer-2025-06-04_20-34-48-zyh4ry4e.html

The corresponding code can be found in our example repository: puppeteer-demo/extract-data.ts

Refactor Cache

Use xpath cache instead of coordinates, improve cache hit rate.

Refactor cache file format from json to yaml, improve readability.

v0.16 - Support MCP

Midscene MCP

🤖 Use Cursor / Trae to help write test cases. 🕹️ Quickly implement browser operations akin to the Manus platform. 🔧 Integrate Midscene capabilities swiftly into your platforms and tools.

Read more: MCP

Support structured API for agent

APIs: aiBoolean, aiNumber, aiString, aiLocate

Read more: Use JavaScript to Optimize the AI Automation Code

v0.15 - Android automation unlocked!

Android automation unlocked!

🤖 AI Playground: natural‑language debugging 📱 Supports native, Lynx & WebView apps 🔁 Replayable runs 🛠️ YAML or JS SDK ⚡ Auto‑planning & Instant Actions APIs

Read more: Android automation

More features

  • Allow custom midscene_run dir
  • Enhance report filename generation with unique identifiers and support split mode
  • Enhance timeout configurations and logging for network idle and navigation
  • Adapt for gemini-2.5-pro

v0.14 - Instant Actions

"Instant Actions" introduces new atomic APIs, enhancing the accuracy of AI operations.

Read more: Instant Actions

v0.13 - DeepThink Mode

Atomic AI Interaction Methods

  • Supports aiTap, aiInput, aiHover, aiScroll, and aiKeyboardPress for precise AI actions.

DeepThink Mode

  • Enhances click accuracy with deeper contextual understanding.

v0.12 - Integrate Qwen 2.5 VL

Integrate Qwen 2.5 VL's native capabilities

  • Keeps output accuracy.
  • Supports more element interactions.
  • Cuts operating cost by over 80%.

v0.11.0 - UI-TARS Model Caching

✨ UI-TARS Model Support Caching

✨ Optimize DOM Tree Extraction Strategy

  • Optimize the information ability of the dom tree, accelerate the inference process of models like GPT 4o

v0.10.0 - UI-TARS Model Released

UI-TARS is a Native GUI agent model released by the Seed team. It is named after the TARS robot in the movie Star Trek, which has high intelligence and autonomous thinking capabilities. UI-TARS takes images and human instructions as input information, can correctly perceive the next action, and gradually approach the goal of human instructions, leading to the best performance in various benchmark tests of GUI automation tasks compared to open-source and closed-source commercial models.

UI-TARS: Pioneering Automated GUI Interaction with Native Agents - Figure 1

UI-TARS: Pioneering Automated GUI Interaction with Native - Figure 4

Model Advantage

UI-TARS has the following advantages in GUI tasks:

  • Target-driven

  • Fast inference speed

  • Native GUI agent model

  • Private deployment without data security issues

v0.9.0 - Bridge Mode Released

With the Midscene browser extension, you can now use scripts to link with the desktop browser for automated operations!

We call it "Bridge Mode".

Compared to previous CI environment debugging, the advantages are:

  1. You can reuse the desktop browser, especially Cookie, login state, and front-end interface state, and start automation without worrying about environment setup.

  2. Support manual and script cooperation to improve the flexibility of automation tools.

  3. Simple business regression, just run it locally with Bridge Mode.

Documentation: Use Chrome Extension to Experience Midscene

v0.8.0 - Chrome Extension

✨ New Chrome Extension, Run Midscene Anywhere

Through the Midscene browser extension, you can run Midscene on any page, without writing any code.

Experience it now 👉:Use Chrome Extension to Experience Midscene

v0.7.0 - Playground Ability

✨ Playground Ability, Debug Anytime

Now you don't have to keep re-running scripts to debug prompts!

On the new test report page, you can debug the AI execution results at any time, including page operations, page information extraction, and page assertions.

v0.6.0 - Doubao Model Support

✨ Doubao Model Support

  • Support for calling Doubao models, reference the environment variables below to experience.
MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"https://xxx.net/api/v3","apiKey":"xxx"}'
MIDSCENE_MODEL_NAME='ep-20240925111815-mpfz8'
MIDSCENE_MODEL_TEXT_ONLY='true'

Summarize the availability of Doubao models:

  • Currently, Doubao only has pure text models, which means "seeing" is not available. In scenarios where pure text is used for reasoning, it performs well.

  • If the use case requires combining UI analysis, it is completely unusable

Example:

✅ The price of a multi-meat grape (can be guessed from the order of the text on the interface)

✅ The language switch text button (can be guessed from the text content on the interface: Chinese, English text)

❌ The left-bottom play button (requires image understanding, failed)

✨ Support for GPT-4o Structured Output, Cost Reduction

By using the gpt-4o-2024-08-06 model, Midscene now supports structured output (structured-output) features, ensuring enhanced stability and reduced costs by 40%+.

Midscene now supports hitting GPT-4o prompt caching features, and the cost of AI calls will continue to decrease as the company's GPT platform is deployed.

✨ Test Report: Support Animation Playback

Now you can view the animation playback of each step in the test report, quickly debug your running script

✨ Speed Up: Merge Plan and Locate Operations, Response Speed Increased by 30%

In the new version, we have merged the Plan and Locate operations in the prompt execution to a certain extent, which increases the response speed of AI by 30%.

Before

after

✨ Test Report: The Accuracy of Different Models

  • GPT 4o series models, 100% correct rate

  • doubao-pro-4k pure text model, approaching usable state

🐞 Problem Fix

  • Optimize the page information extraction to avoid collecting obscured elements, improving success rate, speed, and AI call cost 🚀

before

after

v0.5.0 - Support GPT-4o Structured Output

✨ New Features

  • Support for gpt-4o-2024-08-06 model to provide 100% JSON format limit, reducing Midscene task planning hallucination behavior

  • Support for Playwright AI behavior real-time visualization, improve the efficiency of troubleshooting

  • Cache generalization, cache capabilities are no longer limited to playwright, pagepass, puppeteer can also use cache
- playwright test --config=playwright.config.ts
# Enable cache
+ MIDSCENE_CACHE=true playwright test --config=playwright.config.ts
  • Support for azure openAI

  • Support for AI to add, delete, and modify the existing input

🐞 Problem Fix

  • Optimize the page information extraction to avoid collecting obscured elements, improving success rate, speed, and AI call cost 🚀

  • During the AI interaction process, unnecessary attribute fields were trimmed, reducing token consumption.

  • Optimize the AI interaction process to reduce the likelihood of hallucination in KeyboardPress and Input events

  • For pagepass, provide an optimization solution for the flickering behavior that occurs during the execution of Midscene

// Currently, pagepass relies on a too low version of puppeteer, which may cause the interface to flicker and the cursor to be lost. The following solution can be used to solve this problem
const originScreenshot = puppeteerPage.screenshot;
puppeteerPage.screenshot = async (options) => {
  return await originScreenshot.call(puppeteerPage, {
    ...options,
    captureBeyondViewport: false
  });
};

v0.4.0 - Support Cli Usage

✨ New Features

  • Support for Cli usage, reducing the usage threshold of Midscene
# headed mode (visible browser) access baidu.com and search "weather"
npx @midscene/cli --headed --url https://www.baidu.com --action "input 'weather', press enter" --sleep 3000

# visit github status page and save the status to ./status.json
npx @midscene/cli --url https://www.githubstatus.com/ \
  --query-output status.json \
  --query '{serviceName: string, status: string}[], github page status, return service name'
  • Support for AI to wait for a certain time to continue the subsequent task execution

  • Playwright AI task report shows the overall time and aggregates AI tasks by test group

🐞 Problem Fix

  • Optimize the AI interaction process to reduce the likelihood of hallucination in KeyboardPress and Input events

v0.3.0 - Support AI Report HTML

✨ New Features

  • Generate html format AI report, aggregate AI tasks by test group, facilitate test report distribution

🐞 Problem Fix

  • Fix the problem of AI report scrolling preview

v0.2.0 - Control puppeteer by natural language

✨ New Features

  • Support for using natural language to control puppeteer to implement page automation 🗣️💻

  • Provide AI cache capabilities for playwright framework, improve stability and execution efficiency

  • AI report visualization, aggregate AI tasks by test group, facilitate test report distribution

  • Support for AI to assert the page, let AI judge whether the page meets certain conditions

v0.1.0 - Control playwright by natural language

✨ New Features

  • Support for using natural language to control puppeteer to implement page automation 🗣️💻

  • Support for using natural language to extract page information 🔍🗂️

  • AI report visualization, AI behavior, AI thinking visualization 🛠️👀

  • Direct use of GPT-4o model, no training required 🤖🔧

On This Page