In this article, we will talk about what kind of models are supported by Midscene.js and the features of each model.
Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.
If you want to see the detailed configuration of model services, see Config Model and Provider.
After applying for the API key on Openrouter or Aliyun, you can use the following config:
After applying for the API key on Google Gemini, you can use the following config:
You can use doubao-1.5-ui-tars
on Volcengine:
Midscene supports two types of models, which are:
And we are primarily concerned with two features of the model:
The main difference between different models is the way they handle the locating capability.
When using LLMs like GPT-4o, locating is accomplished through the model's understanding of the UI hierarchy tree and the markup on the screenshot, which consumes more tokens and does not always yield accurate results. In contrast, when using VL models, locating relies on the model's visual grounding capabilities, providing a more native and reliable solution in complex situations.
In the Android automation scenario, we decided to use the VL models since the infrastructure of the App in the real world is so complex that we don't want to do any adaptive work on the App UI stack any more. The VL models can provide us with more reliable results, and it should be a better approach to this type of work.
GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred.
Features
Limitations when used in Midscene.js
<iframe />
or <canvas />
.Config
From 0.12.0 version, Midscene.js supports Qwen-2.5-VL-72B-Instruct model.
Qwen-2.5-VL is an open-source model published by Alibaba. It provides Visual Grounding ability, which can accurately return the coordinates of target elements on the page. When using it for interaction, assertion and query, it performs quite well. We recommend using the largest version (72B) for reliable output.
Qwen-2.5-VL indeed has an action planning feature to control the application, but we still recommend using detailed prompts to provide a more stable and reliable result.
Features
Limitations when used in Midscene.js
deepThink
parameter and optimize the description, otherwise the recognition result may not be accurate.Config
Except for the regular config, you need to include the MIDSCENE_USE_QWEN_VL=1
config to turn on Qwen-2.5-VL mode. Otherwise, it will be the default GPT-4o mode (much more tokens used).
Note about the model name on Aliyun.com
While the open-source version of Qwen-2.5-VL (72B) is named qwen2.5-vl-72b-instruct
, there is also an enhanced and more stable version named qwen-vl-max-latest
officially hosted on Aliyun.com. When using the qwen-vl-max-latest
model on Aliyun, you will get larger context support and a much lower price (likely only 19% of the open-source version).
In short, if you want to use the Aliyun service, use qwen-vl-max-latest
.
Links
Gemini-2.5-Pro is a model provided by Google Cloud. It works somehow similar to Qwen-2.5-VL, but it's not open-source.
From 0.15.1 version, Midscene.js supports Gemini-2.5-Pro model.
When using Gemini-2.5-Pro, you should use the MIDSCENE_USE_GEMINI=1
config to turn on the Gemini-2.5-Pro mode.
Links
UI-TARS is an end-to-end GUI agent model based on VLM architecture. It solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks. UI-TARS is an open-source model, and provides different versions of size.
When using UI-TARS, you can use target-driven style prompts, like "Login with user name foo and password bar", and it will plan the steps to achieve the goal.
Features
doubao-1.5-ui-tars
deployed on volcengine as an example, its response time is obviously faster than other models.Limitations when used in Midscene.js
Config
Except for the regular config, you need to include the MIDSCENE_USE_VLM_UI_TARS
parameter to specify the UI-TARS version, supported values are 1.0
1.5
DOUBAO
(volcengine version). Otherwise, you will get some JSON parsing error.
Use the version provided by Volcengine
On the Volcengine, there is a doubao-1.5-ui-tars
model that has been deployed. Developers can access the model directly via API calls and pay based on usage. Docs link: https://www.volcengine.com/docs/82379/1536429
When using the Volcengine version of the model, you need to create an inference access point(like ep-2025...
). After collecting the API Key and inference access point ID, configure should look like this:
Links:
Other models are also supported by Midscene.js. Midscene will use the same prompt and strategy as GPT-4o for these models. If you want to use other models, please follow these steps:
OPENAI_BASE_URL
, OPENAI_API_KEY
and MIDSCENE_MODEL_NAME
. Config are described in Config Model and Provider.MIDSCENE_USE_VLM_UI_TARS
and MIDSCENE_USE_QWEN_VL
config unless you know what you are doing.For more details and sample config, see Config Model and Provider.
By setting DEBUG=midscene:ai:profile:stats
in the environment variables, you can print the model's usage info and response time.