Model Strategy

Quick start

If you want to try Midscene right away, pick a model and follow its configuration guide:

Doubao Seed Model
Qwen Models: Qwen3.5, Qwen3-VL, Qwen2.5-VL
Zhipu GLM-V
Zhipu AutoGLM
Gemini-3-Pro / Gemini-3-Flash
UI-TARS

This guide focuses on how we choose models in Midscene. If you need configuration instructions, head to Model configuration.

Background - UI automation approaches

Driving UI automation with AI hinges on two challenges: planning a reasonable set of actions and precisely locating the elements that require interaction. The strength of element localization directly determines how successful an automation task will be.

To solve element localization, UI automation frameworks traditionally follow one of two approaches:

DOM + annotated screenshots: Extract the DOM tree beforehand, annotate screenshots with DOM metadata, and ask the model to “pick” the right nodes.
Pure vision: Perform all analysis on screenshots alone by using the visual grounding capabilities of the model. The model only receives the image—no DOM, no annotations.

Pure vision for element localization

Earlier Midscene releases supported both DOM localization and pure vision so developers could compare them. After dozens of versions and hundreds of project tests, we have some new findings.

The DOM approach has been less stable than expected. It often misfires with canvas elements, controls rendered via CSS background-image, content inside cross-origin iframes, or elements without strong accessibility annotations. These intermittent failures can send teams into long debugging sessions or even a frustrating prompt-tuning loop.

Meanwhile, the pure-vision route has started to show clear advantages:

Stable results: These models pair strong UI planning with element targeting and interface understanding, helping developers get productive quickly.
Works everywhere: Automation no longer depends on how the UI is rendered. Android, iOS, desktop apps, or a browser <canvas>—if you can capture a screenshot, Midscene can interact with it.
Developer-friendly: Removing selectors and DOM fuss makes the model “handshake” easier. Even teammates unfamiliar with rendering tech can become productive quickly.
Far fewer tokens: Compared to the DOM approach, pure vision can cut token usage by up to ~80%, reducing cost and speeding up local runs.
Open-source options: Open-source vision models keep improving. Teams can self-host options such as Qwen3-VL 8B/30B with solid results.

Given these advantages, Midscene 1.0 and later only support the pure-vision approach—the DOM-extraction compatibility mode has been removed. This applies to UI actions and element localization; for data extraction or page understanding you can still opt in to include DOM when needed.

Recommended vision models

Based on extensive real-world usage, we recommend these defaults for Midscene: Doubao Seed, Qwen VL, Gemini-3 (Pro/Flash), and UI-TARS.

They offer strong element-localization skills and solid performance in planning and screen understanding.

If you are unsure where to start, pick whichever model is easiest to access today, then run side-by-side comparisons later.

Model family	Deployment	Midscene notes
Doubao Seed Model Quick setup	Volcano Engine: Doubao-Seed-1.6-Vision Doubao-Seed-2.0-Lite	⭐⭐⭐⭐ Strong at UI planning and targeting Slightly slower
Qwen3.5 Quick setup	Alibaba Cloud OpenRouter	⭐⭐⭐⭐ Stronger than Qwen3-VL and Qwen2.5-VL
Zhipu GLM-4.6V Quick setup	Z.AI (Global) BigModel (CN)	Newly integrated, welcome to try it out Weights open-sourced on HuggingFace
Gemini-3-Pro / Gemini-3-Flash Quick setup	Google Cloud	⭐⭐⭐ Gemini-3-Flash is supported Price is higher than Doubao and Qwen
UI-TARS Quick setup	Volcano Engine	⭐⭐ Strong exploratory ability but results vary by scenario Open-source versions available (HuggingFace / GitHub)

Why not use multimodal models like gpt-5 as the default?

Midscene requires strong UI localization (visual grounding). Models like gpt-5 perform poorly here, so they cannot serve as the default. You can still use them as dedicated “planning models,” which we cover below.

Advanced: combining multiple models

The default model strategy covers most projects at kick-off. As prompts grow more complex and you need stronger generalization, the default models’ planning ability may fall short. Consider this GitHub sign-up prompt:

Fill out the GitHub sign-up form. Ensure no fields are missing and pick “United States” as the region.

Make sure every field passes validation.

Only fill the form; do not submit a real registration request.

Return the actual field values you filled.

This sounds simple, but the model must understand per-field rules, identify each control, operate a multi-step region selector, paginate and trigger validation, and still locate every element. Default models often cannot satisfy all of these at once, so success rates drop.

In these situations, configure separate models for Planning and Insight, while keeping the default model as your baseline. This is not “only Planning and Insight”—it is default model + (optional) Planning/Insight. Multi-model setups are the most practical and effective way to lift UI automation success, at a small cost in latency and tokens.

Model responsibility overview

By default, the system follows these rules (unset intents fall back to the default model):

Default model: handles element localization (Locate), plus anything not explicitly assigned to Planning/Insight.
Planning model (optional): handles task planning (Planning in aiAct / ai).
Insight model (optional): handles data extraction, assertions, and UI understanding (aiQuery / aiAsk / aiAssert).

In short: Planning goes to the Planning model, localization stays on the default model, and extraction/assertion goes to Insight when configured.

For a ready-to-use environment snippet, including an Insight model, see the multi-model example.

Planning intent

For aiAct or ai planning tasks, add settings with the MIDSCENE_PLANNING_MODEL_ prefix to use a dedicated model for the Planning intent.

We recommend gpt-5.1 or other multimodal models that understand UI workflows.

Insight intent

Midscene provides page-understanding APIs such as AI assertions (aiAssert) and data extraction (aiQuery, aiAsk). We group these workloads under the Insight intent, which depends on visual question answering (VQA) skills.

Add settings with the MIDSCENE_INSIGHT_MODEL_ prefix to use a dedicated model for Insight workloads.

We recommend gpt-5.1 or other models with strong VQA capability.

How do I tune execution results?

If you see success rates below your target, try these steps.

Review Midscene's replay report to confirm the task timeline is correct and the flow didn't veer into the wrong page or branch.
Use the best, newest, and larger model versions to materially improve success rates. For example, Qwen3-VL outperforms Qwen2.5-VL, and a 72B build is more accurate than a 30B build.
Ensure the MIDSCENE_MODEL_FAMILY environment variable is set correctly; otherwise element localization can drift significantly.
Tweak the planningStrategy option in aiAct to increase planning depth.
Try different models, or combine multiple models to compensate for weaker understanding.

Limitations of the pure-vision approach

Vision models are a highly general approach: they do not depend on a specific UI rendering stack and can analyze screenshots directly, letting developers ramp up and tune quickly across any app surface.

The flip side is higher demands on the model itself.

For example, on mobile UI automation, if you have a component tree plus full a11y annotations, you can lean on a small text-only model and reason over structure, potentially pushing performance further. Pure vision skips those annotations and is more universal, saving UI labeling effort but consuming more model resources at runtime.

Model configuration doc

See Model configuration.

About the `deepThink` option in `aiAct`

The deepThink option controls whether models use deep reasoning during planning. Currently supported for qwen3-vl, qwen3.5, doubao-seed-1.6, doubao-seed-2.0 and glm-v model families.

When enabled, you'll see a Reasoning section in the report. Planning takes longer but produces more accurate results.

Default behavior varies by provider:

qwen3.5 on Alibaba Cloud: enabled by default (slower, more accurate)
qwen3-vl on Alibaba Cloud: disabled by default (faster, less accurate)
doubao-seed-1.6 on Volcano Engine: enabled by default (slower, more accurate)
doubao-seed-2.0 on Volcano Engine: enabled by default (slower, more accurate)
glm-v on Z.AI: enabled by default (slower, more accurate)

You can explicitly set deepThink: true or deepThink: false to override the provider's default.

deepThink accepts 'unset' | true | false. The default is 'unset' (same as omitting the option, following the model provider's default strategy).

Note: The implementation details behind deepThink may evolve in the future as model providers change.

API style for model service provider

Midscene expects providers to expose an OpenAI-compatible API (this does not mean you must use OpenAI models).

Most major vendors and deployment tools already support this pattern.

How do I inspect token usage?

Set DEBUG=midscene:ai:profile:stats to print cost and latency information. You can also review token usage in the generated report files.

"MIDSCENE_MODEL_FAMILY is required" error

If you see "No visual language model (VL model) detected" or "MIDSCENE_MODEL_FAMILY is required," make sure the MIDSCENE_MODEL_FAMILY environment variable for your VL model is set correctly.

Starting with version 1.0, Midscene recommends using MIDSCENE_MODEL_FAMILY to specify the vision model type. Legacy MIDSCENE_USE_... configs remain compatible but are deprecated.

See Model configuration for setup details.

Can each Agent instance use its own model?

Yes. You can configure per-agent models via the modelConfig parameter. See the API reference for details.

Want to send browser DOM info to the model?

By default, Midscene does not send browser DOM data to models. If you need to include it for UI understanding (for example, to pass details not visible in a screenshot), set domIncluded to true in the options for interfaces like aiAsk or aiQuery. See the API reference for details.

Backward compatibility

From version 1.0 onward, Midscene.js recommends these environment variable names, for example:

MIDSCENE_MODEL_API_KEY
MIDSCENE_MODEL_BASE_URL

For compatibility, we still support these OpenAI-style names, though they are no longer preferred:

OPENAI_API_KEY
OPENAI_BASE_URL

When both are present, Midscene prefers the new MIDSCENE_MODEL_* variables.

Troubleshooting model service connectivity issues

If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test

Put your .env file in the connectivity-test folder, and run the test with npm i && npm run test.

#Model Strategy

#Background - UI automation approaches

#Pure vision for element localization

#Recommended vision models

#Advanced: combining multiple models

#Model responsibility overview

#Planning intent

#Insight intent

#How do I tune execution results?

#More

#Limitations of the pure-vision approach

#Model configuration doc

#About the deepThink option in aiAct

#API style for model service provider

#How do I inspect token usage?

#"MIDSCENE_MODEL_FAMILY is required" error

#Can each Agent instance use its own model?

#Want to send browser DOM info to the model?

#Backward compatibility

#Troubleshooting model service connectivity issues