Choose a model

Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.

Adapted models for using Midscene.js

Midscene.js supports two types of models, which are:

LLM models

Models that can understand text and image input. GPT-4o is this kind of model.

LLM models can only be used in web automation.

Besides the ability to understand text and image input, VL(Visual-Language) models can also locate the coordinates of target elements on the page.

We recommend using VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements on the page, making them more reliable and efficient in complex scenarios.

VL models can be used in UI automation of any kind of interfaces.

These are the adapted VL models:

If you want to see the detailed configuration of model services, see Config Model and Provider.

Models in depth

GPT-4o

GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred.

The token cost of using GPT-4o is higher since Midscene needs to send some dom info and screenshot to the model, and it's not stable in complex scenarios.

Config

OPENAI_API_KEY="......"
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # optional, if you want an endpoint other than the default one from OpenAI.
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # optional. The default is "gpt-4o".

Qwen 2.5-VL (Openrouter or Alibaba Cloud)

Starting from version 0.12.0, Midscene.js supports the Qwen-2.5-VL-72B-Instruct model series.

Qwen-2.5-VL is an open-source model series published by Alibaba. It provides visual grounding capability and can accurately return the coordinates of target elements on the page. It performs quite well when used for interaction, assertion, and query. We recommend using the largest version (72B) for reliable output.

config

After applying for the API key on Openrouter you can use the following config:

OPENAI_BASE_URL="https://openrouter.ai/api/v1"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen/qwen2.5-vl-72b-instruct"
MIDSCENE_USE_QWEN_VL=1

Or using Alibaba Cloud:

OPENAI_BASE_URL="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1

Limitations when used in Midscene.js

  • Not good at recognizing small icons: When recognizing small icons, you may need to enable the deepThink parameter and optimize the description, otherwise the recognition results may not be accurate.
  • Unstable assertion performance: We observed that it may not perform as well as GPT-4o or Doubao-1.5-thinking-vision-pro in assertion.

Note about model deployment on Alibaba Cloud

While the open-source version of Qwen-2.5-VL (72B) is named qwen2.5-vl-72b-instruct, there is also an enhanced and more stable version named qwen-vl-max-latest officially hosted on Alibaba Cloud. When using the qwen-vl-max-latest model on Alibaba Cloud, you will get larger context support and a much lower price (possibly only 19% of the open-source version).

In short, if you want to use Alibaba Cloud service, please use qwen-vl-max-latest.

Links

Gemini-2.5-Pro (Google Gemini)

Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a closed-source model provided by Google Cloud.

When using Gemini-2.5-Pro, you should use the MIDSCENE_USE_GEMINI=1 configuration to enable Gemini-2.5-Pro mode.

Config

After applying for the API key on Google Gemini, you can use the following config:

OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-05-06"
MIDSCENE_USE_GEMINI=1

Links

Volcano Engine provides multiple visual language models, including:

  • Doubao-1.5-thinking-vision-pro
  • Doubao-seed-1.6-vision

They perform quite well in visual grounding and assertion in complex scenarios. With clear instructions, they can meet most business scenario requirements and are currently the most recommended visual language models for Midscene.

Config

After obtaining API key from Volcano Engine, you can use the following configuration:

OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3" 
OPENAI_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_DOUBAO_VISION=1

Links

UI-TARS (Volcano Engine)

UI-TARS is an end-to-end GUI agent model based on VLM architecture. It only perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks. UI-TARS is an open-source model and provides different size versions.

When using UI-TARS, you can use target-driven style prompts, like "Login with user name foo and password bar", and it will plan the steps to achieve the goal.

Config

You can use the deployed doubao-1.5-ui-tars on Volcano Engine.

OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3" 
OPENAI_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-2025..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_VLM_UI_TARS=DOUBAO

Limitations

  • Poor assertion performance: It may not perform as well as GPT-4o and Qwen 2.5 in assertion and query.
  • Unstable operation path: It may try different paths to achieve the goal, so the operation path is unstable each time you call it.

About the MIDSCENE_USE_VLM_UI_TARS configuration

The MIDSCENE_USE_VLM_UI_TARS configuration is used to specify the UI-TARS version, using one of the following values:

  • 1.0 - for model version 1.0
  • 1.5 - for model version 1.5
  • DOUBAO - for the model deployed on Volcano Engine

Links

Choose other multimodal LLMs

Other models are also supported by Midscene.js. Midscene will use the same prompt and strategy as GPT-4o for these models. If you want to use other models, please follow these steps:

  1. A multimodal model is required, which means it must support image input.
  2. The larger the model, the better it works. However, it needs more GPU or money.
  3. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the OPENAI_BASE_URL, OPENAI_API_KEY and MIDSCENE_MODEL_NAME. Config are described in Config Model and Provider.
  4. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in Prompting Tips.
  5. Remember to follow the terms of use of each model and provider.
  6. Don't include the MIDSCENE_USE_VLM_UI_TARS and MIDSCENE_USE_QWEN_VL config unless you know what you are doing.

Config

MIDSCENE_MODEL_NAME="....."
OPENAI_BASE_URL="......"
OPENAI_API_KEY="......"

For more details and sample config, see Config Model and Provider.

FAQ

How can I check the model's token usage?

By setting DEBUG=midscene:ai:profile:stats in the environment variables, you can print the model's usage info and response time.

You can also see your model's usage info in the report file.

More

Troubleshooting model service connectivity issues

If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test

Put your .env file in the connectivity-test folder, and run the test with npm i && npm run test.