Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.
Midscene.js supports two types of models, which are:
Models that can understand text and image input. GPT-4o is this kind of model.
LLM models can only be used in web automation.
Besides the ability to understand text and image input, VL(Visual-Language) models can also locate the coordinates of target elements on the page.
We recommend using VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements on the page, making them more reliable and efficient in complex scenarios.
VL models can be used in UI automation of any kind of interfaces.
These are the adapted VL models:
If you want to see the detailed configuration of model services, see Config Model and Provider.
GPT-4o
GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred.
The token cost of using GPT-4o is higher since Midscene needs to send some dom info and screenshot to the model, and it's not stable in complex scenarios.
Config
Starting from version 0.12.0, Midscene.js supports the Qwen-2.5-VL-72B-Instruct model series.
Qwen-2.5-VL is an open-source model series published by Alibaba. It provides visual grounding capability and can accurately return the coordinates of target elements on the page. It performs quite well when used for interaction, assertion, and query. We recommend using the largest version (72B) for reliable output.
config
After applying for the API key on Openrouter you can use the following config:
Or using Alibaba Cloud:
Limitations when used in Midscene.js
deepThink
parameter and optimize the description, otherwise the recognition results may not be accurate.GPT-4o
or Doubao-1.5-thinking-vision-pro
in assertion.Note about model deployment on Alibaba Cloud
While the open-source version of Qwen-2.5-VL (72B) is named qwen2.5-vl-72b-instruct
, there is also an enhanced and more stable version named qwen-vl-max-latest
officially hosted on Alibaba Cloud. When using the qwen-vl-max-latest
model on Alibaba Cloud, you will get larger context support and a much lower price (possibly only 19% of the open-source version).
In short, if you want to use Alibaba Cloud service, please use qwen-vl-max-latest
.
Links
Gemini-2.5-Pro
(Google Gemini)Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a closed-source model provided by Google Cloud.
When using Gemini-2.5-Pro, you should use the MIDSCENE_USE_GEMINI=1
configuration to enable Gemini-2.5-Pro mode.
Config
After applying for the API key on Google Gemini, you can use the following config:
Links
Volcano Engine provides multiple visual language models, including:
Doubao-1.5-thinking-vision-pro
Doubao-seed-1.6-vision
They perform quite well in visual grounding and assertion in complex scenarios. With clear instructions, they can meet most business scenario requirements and are currently the most recommended visual language models for Midscene.
Config
After obtaining API key from Volcano Engine, you can use the following configuration:
Links
UI-TARS
(Volcano Engine)UI-TARS is an end-to-end GUI agent model based on VLM architecture. It only perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks. UI-TARS is an open-source model and provides different size versions.
When using UI-TARS, you can use target-driven style prompts, like "Login with user name foo and password bar", and it will plan the steps to achieve the goal.
Config
You can use the deployed doubao-1.5-ui-tars
on Volcano Engine.
Limitations
About the MIDSCENE_USE_VLM_UI_TARS
configuration
The MIDSCENE_USE_VLM_UI_TARS
configuration is used to specify the UI-TARS version, using one of the following values:
1.0
- for model version 1.0
1.5
- for model version 1.5
DOUBAO
- for the model deployed on Volcano EngineLinks
Other models are also supported by Midscene.js. Midscene will use the same prompt and strategy as GPT-4o for these models. If you want to use other models, please follow these steps:
OPENAI_BASE_URL
, OPENAI_API_KEY
and MIDSCENE_MODEL_NAME
. Config are described in Config Model and Provider.MIDSCENE_USE_VLM_UI_TARS
and MIDSCENE_USE_QWEN_VL
config unless you know what you are doing.Config
For more details and sample config, see Config Model and Provider.
By setting DEBUG=midscene:ai:profile:stats
in the environment variables, you can print the model's usage info and response time.
You can also see your model's usage info in the report file.
If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test
Put your .env
file in the connectivity-test
folder, and run the test with npm i && npm run test
.