Midscene.js - joyful automation by AI

Driving all platforms UI automation with vision-based model

📣 v1.0 Release Announcement

We have released v1.0. It is currently published on npm.
For the latest documentation and code, please visit https://midscenejs.com/ and the main branch.
For historical documentation, please visit https://v0.midscenejs.com/.
v1.0 changelog: https://midscenejs.com/changelog

Features

Write Automation with Natural Language

  • Describe your goals and steps, and Midscene will plan and operate the user interface for you.
  • Use Javascript SDK or YAML to write your automation script.

Web & Mobile App & Any Interface

For Developers

  • Three kinds of APIs:
  • MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
  • Caching for Efficiency: Replay your script with cache and get the result faster.
  • Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.

Showcases

Register the GitHub form autonomously in a web browser and pass all field validations:

Plus these real-world showcases:

Zero-code quick experience

Driven by Visual Language Model

Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-flash, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.

  • Pure-vision localization for UI actions; the DOM extraction mode is removed.
  • Works across web, mobile, desktop, and even <canvas> surfaces.
  • Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
  • DOM can still be included for data extraction and page understanding when needed.
  • Strong open-source options for self-hosting.

Read more about Model Strategy

Two styles of automation

Auto planning

AI autonomously plans and executes the flow to complete the task.

await aiAct('click all the records one by one. If one record contains the text "completed", skip it');

Workflow style

Split complex logic into multiple steps to improve the stability of the automation code.

const recordList = await agent.aiQuery('string[], the record list')
for (const record of recordList) {
  const hasCompleted = await agent.aiBoolean(`check if the record ${record}" contains the text "completed"`)
  if (!hasCompleted) {
    await agent.aiTap(record)
  }
}

For more details about the workflow style, please refer to Use JavaScript to Optimize the AI Automation Code

Resources

Community

Credits

We would like to thank the following projects:

  • Rsbuild and Rslib for the build tool.
  • UI-TARS for the open-source agent model UI-TARS.
  • Qwen2.5-VL for the open-source VL model Qwen2.5-VL.
  • scrcpy and yume-chan allow us to control Android devices with browser.
  • appium-adb for the javascript bridge of adb.
  • appium-webdriveragent for operating XCTest with JavaScript.
  • YADB for the yadb tool which improves the performance of text input.
  • Puppeteer for browser automation and control.
  • Playwright for browser automation and control and testing.

License

Midscene.js is MIT licensed.