In this article, we will talk about how to choose a model for Midscene.js.
If you want to see the detailed configuration of model and provider, see Config Model and Provider.
Midscene.js uses general-purpose large language models (LLMs, like gpt-4o
) as the default model. This is the easiest way to get started.
You can also use open-source models like UI-TARS
to improve the performance and data privacy.
Midscene uses OpenAI gpt-4o
as the default model, since this model performs the best among all general-purpose LLMs at this moment.
If you want to use other models, please follow these steps:
OPENAI_BASE_URL
, OPENAI_API_KEY
and MIDSCENE_MODEL_NAME
. Config are described in Config Model and Provider.Besides gpt-4o
, the known supported models are:
claude-3-opus-20240229
gemini-1.5-pro
qwen-vl-max-latest
doubao-vision-pro-32k
UI-TARS
(a open-source model dedicated for UI automation)UI-TARS is an end-to-end GUI agent model based on VLM architecture. It solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks.
UI-TARS is an open-source model, and provides different versions of size. You can deploy it on your own server, and it will dramatically improve the performance and data privacy.
For more details about UI-TARS, see Github - UI-TARS, 🤗 HuggingFace - UI-TARS-7B-SFT.
.ai
call can be processed in 1-2 seconds.The output of UI-TARS
is different from the general-purpose LLMs. Some extra work is needed to adapt it. You should append the following config to enable this feature.
General LLMs can 'see' the screenshot, but they cannot provide the coordinates of the elements. To do automation tasks, we need to take extra steps to extract the elements' information and send it along with the screenshot to the LLMs. When LLMs return the id of the element, we will map it back to the coordinates and control them.
This approach works in most cases, but it results in increased latency and costs. Additionally, we cannot extract contents in <iframe />
or <canvas />
tags.
UI-TARS
UI-TARS
is a model dedicated to UI automation. We only need to send the screenshot and the instructions, and it will return the actions and the coordinates to be performed.
This is more straightforward in agent design. Furthermore, the performance of self-hosted UI-TARS
model is truly amazing. So we are very to happy to integrate it into Midscene.js as an alternative approach.
Use general-purpose LLMs first, this is the easiest way to get started.
Once you feel uncomfortable with the speed, the cost, the accuracy, or the data privacy, you can try UI-TARS
model. You will surely know when to start (or not to start) after using general-purpose LLMs.