For the complete changelog, please refer to: Midscene Releases
Add the _unstableLogContent
API to the agent. Get the execution process data of Midscene, including the time of each step, the AI tokens consumed, and the screenshot.
The report is generated based on this data, which means you can customize your own report using this data.
Read more: API documentation
By default, dotenv
does not override the global environment variables in the .env
file. If you want to override, you can use the --dotenv-override
option.
Read more: Use YAML-based Automation Scripts
Reduce the size of the generated report by trimming redundant data, significantly reducing the report file size for complex pages. The typical report file size for complex pages has been reduced from 47.6M to 15.6M!
🚀 Midscene has another update! It makes your testing and automation processes even more powerful:
logScreenshot
API to the agent. Take a screenshot of the current page as a report node, and support setting the node title and description to make the automated testing process more intuitive. Applicable for capturing screenshots of key steps, error status capture, UI validation, etc.Optimize input interactions in Android apps and allow connecting to remote Android devices
autoDismissKeyboard?: boolean
- Optional parameter. Whether to automatically dismiss the keyboard after entering text. The default value is true.
androidAdbPath?: string
- Optional parameter. Used to specify the path of the adb executable file.
remoteAdbHost?: string
- Optional parameter. Used to specify the remote adb host.
remoteAdbPort?: number
- Optional parameter. Used to specify the remote adb port.
Examples:
Upgrade now to experience these powerful new features!
To meet more automation and data extraction scenarios, the following APIs have been enhanced with the options
parameter, supporting more flexible DOM information and screenshots:
agent.aiQuery(dataDemand, options)
agent.aiBoolean(prompt, options)
agent.aiNumber(prompt, options)
agent.aiString(prompt, options)
options
parameterdomIncluded
: Whether to pass the simplified DOM information to AI model, default is off. This is useful for extracting attributes that are not visible on the page, like image links.screenshotIncluded
: Whether to pass the screenshot to AI model, default is on.Have you ever encountered a scenario where you need to automate a right-click operation? Now, Midscene supports a new agent.aiRightClick()
method!
Perform a right-click operation on the specified element, suitable for scenarios where right-click events are customized on web pages. Please note that Midscene cannot interact with the browser's native context menu after right-click.
locate
: Describe the element you want to operate in natural languageoptions
: Optional, supports deepThink
(AI fine-grained positioning) and cacheable
(result caching)In this report file, we show a complete example of using the new aiRightClick
API and new query parameters to extract contact data including hidden attributes.
Report file: puppeteer-2025-06-04_20-34-48-zyh4ry4e.html
The corresponding code can be found in our example repository: puppeteer-demo/extract-data.ts
Use xpath cache instead of coordinates, improve cache hit rate.
Refactor cache file format from json to yaml, improve readability.
🤖 Use Cursor / Trae to help write test cases. 🕹️ Quickly implement browser operations akin to the Manus platform. 🔧 Integrate Midscene capabilities swiftly into your platforms and tools.
Read more: MCP
APIs: aiBoolean
, aiNumber
, aiString
, aiLocate
Read more: Use JavaScript to Optimize the AI Automation Code
🤖 AI Playground: natural‑language debugging 📱 Supports native, Lynx & WebView apps 🔁 Replayable runs 🛠️ YAML or JS SDK ⚡ Auto‑planning & Instant Actions APIs
Read more: Android automation
"Instant Actions" introduces new atomic APIs, enhancing the accuracy of AI operations.
Read more: Instant Actions
Enable caching by document 👉 : Enable Caching
Enable effect
UI-TARS is a Native GUI agent model released by the Seed team. It is named after the TARS robot in the movie Star Trek, which has high intelligence and autonomous thinking capabilities. UI-TARS takes images and human instructions as input information, can correctly perceive the next action, and gradually approach the goal of human instructions, leading to the best performance in various benchmark tests of GUI automation tasks compared to open-source and closed-source commercial models.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents - Figure 1
UI-TARS: Pioneering Automated GUI Interaction with Native - Figure 4
UI-TARS has the following advantages in GUI tasks:
Target-driven
Fast inference speed
Native GUI agent model
Private deployment without data security issues
With the Midscene browser extension, you can now use scripts to link with the desktop browser for automated operations!
We call it "Bridge Mode".
Compared to previous CI environment debugging, the advantages are:
You can reuse the desktop browser, especially Cookie, login state, and front-end interface state, and start automation without worrying about environment setup.
Support manual and script cooperation to improve the flexibility of automation tools.
Simple business regression, just run it locally with Bridge Mode.
Documentation: Use Chrome Extension to Experience Midscene
Through the Midscene browser extension, you can run Midscene on any page, without writing any code.
Experience it now 👉:Use Chrome Extension to Experience Midscene
Now you don't have to keep re-running scripts to debug prompts!
On the new test report page, you can debug the AI execution results at any time, including page operations, page information extraction, and page assertions.
Summarize the availability of Doubao models:
Currently, Doubao only has pure text models, which means "seeing" is not available. In scenarios where pure text is used for reasoning, it performs well.
If the use case requires combining UI analysis, it is completely unusable
Example:
✅ The price of a multi-meat grape (can be guessed from the order of the text on the interface)
✅ The language switch text button (can be guessed from the text content on the interface: Chinese, English text)
❌ The left-bottom play button (requires image understanding, failed)
By using the gpt-4o-2024-08-06 model, Midscene now supports structured output (structured-output) features, ensuring enhanced stability and reduced costs by 40%+.
Midscene now supports hitting GPT-4o prompt caching features, and the cost of AI calls will continue to decrease as the company's GPT platform is deployed.
Now you can view the animation playback of each step in the test report, quickly debug your running script
In the new version, we have merged the Plan and Locate operations in the prompt execution to a certain extent, which increases the response speed of AI by 30%.
Before
after
GPT 4o series models, 100% correct rate
doubao-pro-4k pure text model, approaching usable state
before
after
Support for azure openAI
Support for AI to add, delete, and modify the existing input
Optimize the page information extraction to avoid collecting obscured elements, improving success rate, speed, and AI call cost 🚀
During the AI interaction process, unnecessary attribute fields were trimmed, reducing token consumption.
Optimize the AI interaction process to reduce the likelihood of hallucination in KeyboardPress and Input events
For pagepass, provide an optimization solution for the flickering behavior that occurs during the execution of Midscene
Support for AI to wait for a certain time to continue the subsequent task execution
Playwright AI task report shows the overall time and aggregates AI tasks by test group
Support for using natural language to control puppeteer to implement page automation 🗣️💻
Provide AI cache capabilities for playwright framework, improve stability and execution efficiency
AI report visualization, aggregate AI tasks by test group, facilitate test report distribution
Support for AI to assert the page, let AI judge whether the page meets certain conditions
Support for using natural language to control puppeteer to implement page automation 🗣️💻
Support for using natural language to extract page information 🔍🗂️
AI report visualization, AI behavior, AI thinking visualization 🛠️👀
Direct use of GPT-4o model, no training required 🤖🔧