Model Strategy
If you want to try Midscene right away, pick a model and follow its configuration guide:
This guide focuses on how we choose models in Midscene. If you need configuration instructions, head to Model configuration.
Background - UI automation approaches
Driving UI automation with AI hinges on two challenges: planning a reasonable set of actions and precisely locating the elements that require interaction. The strength of element localization directly determines how successful an automation task will be.
To solve element localization, UI automation frameworks traditionally follow one of two approaches:
- DOM + annotated screenshots: Extract the DOM tree beforehand, annotate screenshots with DOM metadata, and ask the model to “pick” the right nodes.
- Pure vision: Perform all analysis on screenshots alone by using the visual grounding capabilities of the model. The model only receives the image—no DOM, no annotations.
Pure vision for element localization
Earlier Midscene releases supported both DOM localization and pure vision so developers could compare them. After dozens of versions and hundreds of project tests, we have some new findings.
The DOM approach has been less stable than expected. It often misfires with canvas elements, controls rendered via CSS background-image, content inside cross-origin iframes, or elements without strong accessibility annotations. These intermittent failures can send teams into long debugging sessions or even a frustrating prompt-tuning loop.
Meanwhile, the pure-vision route has started to show clear advantages:
- Stable results: These models pair strong UI planning with element targeting and interface understanding, helping developers get productive quickly.
- Works everywhere: Automation no longer depends on how the UI is rendered. Android, iOS, desktop apps, or a browser
<canvas>—if you can capture a screenshot, Midscene can interact with it. - Developer-friendly: Removing selectors and DOM fuss makes the model “handshake” easier. Even teammates unfamiliar with rendering tech can become productive quickly.
- Far fewer tokens: Compared to the DOM approach, pure vision can cut token usage by up to ~80%, reducing cost and speeding up local runs.
- Open-source options: Open-source vision models keep improving. Teams can self-host options such as Qwen3-VL 8B/30B with solid results.
Given these advantages, Midscene 1.0 and later only support the pure-vision approach—the DOM-extraction compatibility mode has been removed. This applies to UI actions and element localization; for data extraction or page understanding you can still opt in to include DOM when needed.
Recommended vision models
Based on extensive real-world usage, we recommend these defaults for Midscene: Doubao Seed, Qwen VL, Gemini-3 (Pro/Flash), and UI-TARS.
They offer strong element-localization skills and solid performance in planning and screen understanding.
If you are unsure where to start, pick whichever model is easiest to access today, then run side-by-side comparisons later.
Midscene requires strong UI localization (visual grounding). Models like gpt-5 perform poorly here, so they cannot serve as the default. You can still use them as dedicated “planning models,” which we cover below.
Advanced: combining multiple models
The default model strategy covers most projects at kick-off. As prompts grow more complex and you need stronger generalization, the default models’ planning ability may fall short. Consider this GitHub sign-up prompt:
This sounds simple, but the model must understand per-field rules, identify each control, operate a multi-step region selector, paginate and trigger validation, and still locate every element. Default models often cannot satisfy all of these at once, so success rates drop.
In these situations, configure separate models for Planning and Insight, while keeping the default model as your baseline. This is not “only Planning and Insight”—it is default model + (optional) Planning/Insight. Multi-model setups are the most practical and effective way to lift UI automation success, at a small cost in latency and tokens.
Model responsibility overview
By default, the system follows these rules (unset intents fall back to the default model):
- Default model: handles element localization (Locate), plus anything not explicitly assigned to Planning/Insight.
- Planning model (optional): handles task planning (Planning in
aiAct/ai). - Insight model (optional): handles data extraction, assertions, and UI understanding (
aiQuery/aiAsk/aiAssert).
In short: Planning goes to the Planning model, localization stays on the default model, and extraction/assertion goes to Insight when configured.
For a ready-to-use environment snippet, including an Insight model, see the multi-model example.
Planning intent
For aiAct or ai planning tasks, add settings with the MIDSCENE_PLANNING_MODEL_ prefix to use a dedicated model for the Planning intent.
We recommend gpt-5.1 or other multimodal models that understand UI workflows.
Insight intent
Midscene provides page-understanding APIs such as AI assertions (aiAssert) and data extraction (aiQuery, aiAsk). We group these workloads under the Insight intent, which depends on visual question answering (VQA) skills.
Add settings with the MIDSCENE_INSIGHT_MODEL_ prefix to use a dedicated model for Insight workloads.
We recommend gpt-5.1 or other models with strong VQA capability.
How do I tune execution results?
If you see success rates below your target, try these steps.
- Review Midscene's replay report to confirm the task timeline is correct and the flow didn't veer into the wrong page or branch.
- Use the best, newest, and larger model versions to materially improve success rates. For example, Qwen3-VL outperforms Qwen2.5-VL, and a 72B build is more accurate than a 30B build.
- Ensure the
MIDSCENE_MODEL_FAMILYenvironment variable is set correctly; otherwise element localization can drift significantly. - Tweak the
planningStrategyoption inaiActto increase planning depth. - Try different models, or combine multiple models to compensate for weaker understanding.
More
Limitations of the pure-vision approach
Vision models are a highly general approach: they do not depend on a specific UI rendering stack and can analyze screenshots directly, letting developers ramp up and tune quickly across any app surface.
The flip side is higher demands on the model itself.
For example, on mobile UI automation, if you have a component tree plus full a11y annotations, you can lean on a small text-only model and reason over structure, potentially pushing performance further. Pure vision skips those annotations and is more universal, saving UI labeling effort but consuming more model resources at runtime.
Model configuration doc
See Model configuration.
About the deepThink option in aiAct
The deepThink option controls whether models use deep reasoning during planning. Currently supported for qwen3-vl and doubao-seed-1.6 model families.
When enabled, you'll see a Reasoning section in the report. Planning takes longer but produces more accurate results.
Default behavior varies by provider:
qwen3-vlon Alibaba Cloud: disabled by default (faster, less accurate)doubao-seed-1.6on Volcano Engine: enabled by default (slower, more accurate)
You can explicitly set deepThink: true or deepThink: false to override the provider's default.
deepThink accepts 'unset' | true | false. The default is 'unset' (same as omitting the option, following the model provider's default strategy).
Note: The implementation details behind deepThink may evolve in the future as model providers change.
API style for model service provider
Midscene expects providers to expose an OpenAI-compatible API (this does not mean you must use OpenAI models).
Most major vendors and deployment tools already support this pattern.
How do I inspect token usage?
Set DEBUG=midscene:ai:profile:stats to print cost and latency information. You can also review token usage in the generated report files.
"MIDSCENE_MODEL_FAMILY is required" error
If you see "No visual language model (VL model) detected" or "MIDSCENE_MODEL_FAMILY is required," make sure the MIDSCENE_MODEL_FAMILY environment variable for your VL model is set correctly.
Starting with version 1.0, Midscene recommends using MIDSCENE_MODEL_FAMILY to specify the vision model type. Legacy MIDSCENE_USE_... configs remain compatible but are deprecated.
See Model configuration for setup details.
Can each Agent instance use its own model?
Yes. You can configure per-agent models via the modelConfig parameter. See the API reference for details.
Want to send browser DOM info to the model?
By default, Midscene does not send browser DOM data to models. If you need to include it for UI understanding (for example, to pass details not visible in a screenshot), set domIncluded to true in the options for interfaces like aiAsk or aiQuery. See the API reference for details.
Backward compatibility
From version 1.0 onward, Midscene.js recommends these environment variable names, for example:
MIDSCENE_MODEL_API_KEYMIDSCENE_MODEL_BASE_URL
For compatibility, we still support these OpenAI-style names, though they are no longer preferred:
OPENAI_API_KEYOPENAI_BASE_URL
When both are present, Midscene prefers the new MIDSCENE_MODEL_* variables.
Does the Doubao phone use Midscene under the hood?
No.
Troubleshooting model service connectivity issues
If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test
Put your .env file in the connectivity-test folder, and run the test with npm i && npm run test.

