It's only recommended to use this kind of goal-oriented prompt when you are using GUI agent models like UI-TARS.
Midscene has a lot of tool developers, who are more concerned with the stability and performance of UI automation tools. To ensure that the Agent can run accurately in complex systems, clear prompts are still the optimal solution.
To further improve stability, we also provide features like Instant Action interface, Playback Report, and Playground. They may seem traditional and not AI-like, but after extensive practice, we believe these features are the real key to improving efficiency.
If you are interested in "smart GUI Agent", you can check out UI-TARS, which Midscene also supports.
Related Docs:
There are some limitations with Midscene. We are still working on them.
Please refer to Choose a model.
The screenshot will be sent to the AI model. If you are using GPT-4o, some key information extracted from the DOM will also be sent.
If you are worried about data privacy issues, please refer to Data Privacy
When using general-purpose LLM in Midscene.js, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. To make the result more stable, the token and time cost is inevitable.
There are two ways to improve the running time:
It's common when the viewport deviceScaleFactor
does not match your system settings. Setting it to 2 in OSX will solve the issue.
The report files are saved in ./midscene-run/report/
by default.
It's mainly about the UI parsing and multimodal AI. Here is a flowchart that describes the core process of the interaction between Midscene and AI.