Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.
Midscene.js supports two types of models:
In addition to understanding text and image input, VL (Visual-Language) models can locate the coordinates of target elements on the page.
We recommend VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements, which is more reliable and efficient in complex scenarios.
VL models can be used for UI automation across any kind of interface.
These VL models are already adapted for Midscene.js:
If you want to learn the detailed configuration for each model provider, see Config Model and Provider.
Models that can understand text and image input. GPT-4o is this kind of model.
LLM models can only be used in web automation.
Qwen-VL is an open-source model series released by Alibaba. It offers visual grounding and can accurately return the coordinates of target elements on a page. The models show strong performance for interaction, assertion, and querying tasks. Deployed versions are available on Alibaba Cloud and OpenRouter.
Midscene.js supports the following versions:
qwen3-vl-plus
(commercial) and qwen3-vl-235b-a22b-instruct
(open source)We recommend the Qwen3-VL series, which clearly outperforms Qwen2.5-VL. Qwen3-VL requires Midscene v0.29.3 or later.
Config for Qwen3-VL
Using the Alibaba Cloud qwen3-vl-plus
model as an example:
Config for Qwen2.5-VL
Using the Alibaba Cloud qwen-vl-max-latest
model as an example:
Links
Volcano Engine provides multiple visual-language models, including:
Doubao-1.5-thinking-vision-pro
Doubao-seed-1.6-vision
They perform strongly for visual grounding and assertion in complex scenarios. With clear instructions they can handle most business needs.
Config
After obtaining an API key from Volcano Engine, you can use the following configuration:
Links
Gemini-2.5-Pro
Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a proprietary model provided by Google Cloud.
When using Gemini-2.5-Pro, set MIDSCENE_USE_GEMINI=1
to enable Gemini-specific behavior.
Config
After applying for the API key on Google Gemini, you can use the following config:
Links
UI-TARS
UI-TARS is an end-to-end GUI agent model based on a VLM architecture. It takes screenshots as input and performs human-like interactions (keyboard, mouse, etc.), achieving state-of-the-art performance across 10+ GUI benchmarks. UI-TARS is open source and available in multiple sizes.
With UI-TARS you can use goal-driven prompts, such as "Log in with username foo and password bar". The model will plan the steps needed to accomplish the task.
Config
You can use the deployed doubao-1.5-ui-tars
on Volcano Engine.
Limitations
About the MIDSCENE_USE_VLM_UI_TARS
configuration
Use MIDSCENE_USE_VLM_UI_TARS
to specify the UI-TARS version with one of the following values:
1.0
- for model version 1.0
1.5
- for model version 1.5
DOUBAO
- for the Volcano Engine deploymentLinks
GPT-4o
GPT-4o is a multimodal LLM by OpenAI that supports image input. This is the default model for Midscene.js. When using GPT-4o, step-by-step prompting generally works best.
The token cost of GPT-4o is relatively high because Midscene sends DOM information and screenshots to the model, and it can be unstable in complex scenarios.
Config
Other models are also supported by Midscene.js. Midscene will use the same prompt and strategy as GPT-4o for these models. If you want to use other models, please follow these steps:
OPENAI_BASE_URL
, OPENAI_API_KEY
and MIDSCENE_MODEL_NAME
. Config are described in Config Model and Provider.MIDSCENE_USE_VLM_UI_TARS
and MIDSCENE_USE_QWEN_VL
config unless you know what you are doing.Config
For more details and sample config, see Config Model and Provider.
By setting DEBUG=midscene:ai:profile:stats
in the environment variables, you can print the model's usage info and response time.
You can also see your model's usage info in the report file.
Make sure you have configured the VL model correctly, especially the MIDSCENE_USE_...
config is set correctly.
If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test
Put your .env
file in the connectivity-test
folder, and run the test with npm i && npm run test
.