English

FAQ

Can Midscene smartly plan the actions according to my one-line goal, like executing "Tweet 'hello world'"?

It's only recommended to use this kind of goal-oriented prompt when you are using GUI agent models like UI-TARS.

Why does Midscene require developers to provide detailed steps while other AI agents are demonstrating "autonomous planning"? Is this an outdated approach?

Midscene has a lot of tool developers, who are more concerned with the stability and performance of UI automation tools. To ensure that the Agent can run accurately in complex systems, clear prompts are still the optimal solution.

To further improve stability, we also provide features like Instant Action interface, Playback Report, and Playground. They may seem traditional and not AI-like, but after extensive practice, we believe these features are the real key to improving efficiency.

If you are interested in "smart GUI Agent", you can check out UI-TARS, which Midscene also supports.

Related Docs:

Limitations

There are some limitations with Midscene. We are still working on them.

The interaction types are limited to only tap, hover, drag (in UI-TARS model only), type, keyboard press, and scroll.
AI model is not 100% stable. Following the Prompting Tips will help improve stability.
You cannot interact with the elements inside the cross-origin iframe and canvas when using GPT-4o. This is not a problem when using Qwen and UI-TARS model.
We cannot access the native elements of Chrome, like the right-click context menu or file upload dialog.
Do not use Midscene to bypass CAPTCHA. Some LLM services are set to decline requests that involve CAPTCHA-solving (e.g., OpenAI), while the DOM of some CAPTCHA pages is not accessible by regular web scraping methods. Therefore, using Midscene to bypass CAPTCHA is not a reliable method.

Which models are supported?

Please refer to Choose a model.

What data is sent to AI model?

The screenshot will be sent to the AI model. If you are using GPT-4o, some key information extracted from the DOM will also be sent.

⁠If you are worried about data privacy issues, please refer to Data Privacy

The automation process is running more slowly than the traditional one

When using multimodal LLM in Midscene.js, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. To make the result more stable, the token and time cost is inevitable.

There are several ways to improve the running time:

Use instant action interface like agent.aiTap('Login Button') instead of agent.ai('Click Login Button'). Read more about it in API.
Use a dedicated model and deploy it yourself, like UI-TARS. This is the recommended way. Read more about it in Choose a model.
Use a lower resolution if possible.
Use caching to accelerate the debug process. Read more about it in Caching.

The webpage continues to flash when running in headed mode

It's common when the viewport deviceScaleFactor does not match your system settings. Setting it to 2 in OSX will solve the issue.

await page.setViewport({
  deviceScaleFactor: 2,
});

Where are the report files saved?

The report files are saved in ./midscene-run/report/ by default.

How can I learn about Midscene's working process?

⁠By reviewing the report file after running the script, you can gain an overview of how Midscene works.

How do I control the report player's default replay style via a link?

You can override the default values of the Focus on cursor and Show element markers toggles by adding query parameters to the report URL, which determines whether the report highlights the cursor position and element markers. Use focusOnCursor and showElementMarkers with values such as true, false, 1, or 0. For example: ...?focusOnCursor=false&showElementMarkers=true.

Customize the network timeout

When doing interaction or navigation on web page, Midscene automatically waits for the network to be idle. It's a strategy to ensure the stability of the automation. Nothing would happen if the waiting process is timeout.

The default timeout is configured as follows:

If it's a page navigation, the default wait timeout is 5000ms (the waitForNavigationTimeout)
If it's a click, input, etc., the default wait timeout is 2000ms (the waitForNetworkIdleTimeout)

You can also customize or disable the timeout by options:

Use waitForNetworkIdleTimeout and waitForNavigationTimeout parameters in Agent.
Use waitForNetworkIdle parameter in Yaml or PlaywrightAiFixture.

On this page

#FAQ

#Can Midscene smartly plan the actions according to my one-line goal, like executing "Tweet 'hello world'"?

#Why does Midscene require developers to provide detailed steps while other AI agents are demonstrating "autonomous planning"? Is this an outdated approach?

#Limitations

#Which models are supported?

#What data is sent to AI model?

#The automation process is running more slowly than the traditional one

#The webpage continues to flash when running in headed mode

#Where are the report files saved?

#How can I learn about Midscene's working process?

#How do I control the report player's default replay style via a link?

#Customize the network timeout

FAQ