Midscene.js - Vision-Driven UI Testing & Automation

Midscene is an open-source SDK for vision-driven UI testing and automation. You describe each step in natural language, and Midscene drives a multimodal model to plan and operate the interface for you — across web, mobile, desktop, and even <canvas> surfaces.

Ready to build?

Head to the Quick start to run your first automation in a few minutes, or try it with no code via the Chrome extension.

Why Midscene

Most UI automation — including AI tools that read the DOM or the accessibility tree — depends on page structure. That structure is fragile and incomplete: selectors break on every refactor; elements without semantic markup (icon-only buttons, custom-rendered controls, <canvas>) are invisible to it; native apps and cross-origin iframes are out of reach; and it cannot tell whether something actually looks right.

Midscene takes a different route: it works from the screenshot alone, using a multimodal model, and you describe each step in natural language — the way a human tester would. That changes what UI testing feels like:

  • Tests stop breaking on every refactor. There are no selectors to chase when markup or styles change, so the maintenance cost of your suite drops sharply.
  • Reach every element and every surface. If a human can see it, Midscene can target it — even elements with no semantic annotations, <canvas>, native apps, and cross-origin iframes that structure-based tools cannot reach.
  • Assert on what users actually see. Verify visual results — colors, highlights, layout, rendered state — not just whether a node exists in the DOM.
  • Two ways to test. Add Midscene to your existing Playwright or Vitest suite, or let an AI agent test your app autonomously through Skills and MCP.
  • Failures you can read. Every run produces a visual report you can replay step by step.

Midscene is built for UI testing first, but the same vision-driven engine handles any UI automation task — use it however fits your work.

What you can automate

Midscene works anywhere you can take a screenshot — web browsers, Android, iOS, HarmonyOS, desktop apps, and any custom interface — all through one API. Each platform has its own getting-started guide in the sidebar.

Write your automation with the JavaScript SDK or in YAML, and look up every method — aiAct, aiQuery, aiAssert, and more — in the API reference.

Two styles of automation

Auto planning

AI autonomously plans and executes the flow to complete the task.

await aiAct('click all the records one by one. If one record contains the text "completed", skip it');

Workflow style

Split complex logic into multiple steps to improve the stability of the automation code.

const recordList = await agent.aiQuery('string[], the record list')
for (const record of recordList) {
  const hasCompleted = await agent.aiBoolean(`check if the record ${record}" contains the text "completed"`)
  if (!hasCompleted) {
    await agent.aiTap(record)
  }
}

For more on the workflow style, see Use JavaScript to optimize the AI automation code.

Driven by Multimodal Models

Midscene supports many popular multimodal models with strong UI localization, so you can pick whichever is easiest to access — including open-source options you can self-host: Qwen3.x, Doubao-Seed-2.0, GLM-4.6V, gemini-3.5-flash, and UI-TARS.

See Model strategy to choose one.

Showcases

Register the GitHub form autonomously in a web browser and pass all field validations:

See more real-world examples across iOS, Android, desktop, and MCP in Showcases.

Resources & community

Credits

Midscene builds on many excellent open-source projects — including UI-TARS, Qwen, Playwright, Puppeteer, scrcpy, appium, WebDriverAgent, YADB, and libnut-core. See the README for the full list.

License

Midscene.js is MIT licensed.