Support Android Automation

From Midscene v0.15, we are happy to announce the support for Android automation. The era for AI-driven Android automation is here!

Showcases

Open Maps, search for a destination, and navigate to it.

Auto-like tweets

Open Twitter, auto-like the first tweet by @midscene_ai.

Suitable for ALL apps

For our developers, all you need is the adb connection and a visual-language model (vl-model) service. Everything is ready!

Behind the scenes, we utilize the visual grounding capabilities of vl-model to locate target elements on the screen. So, regardless of whether it's a native app or a hybrid app with a webview, it makes no difference. Developers can write automation scripts without the burden of worrying about the technology stack of the app.

With ALL the power of Midscene

When using Midscene to do web automation, our users loves the tools like playgrounds and reports. Now, we bring the same power to Android automation!

Use the playground to run automation without any code

Use the report to replay the whole process

Write the automation scripts by yaml file

Connect to the device, open ebay.com, and get some items info.

# search headphone on ebay, extract the items info into a json file, and assert the shopping cart icon

android:
  deviceId: s4ey59

tasks:
  - name: search headphones
    flow:
      - aiAction: open browser and navigate to ebay.com
      - aiAction: type 'Headphones' in ebay search box, hit Enter
      - sleep: 5000
      - aiAction: scroll down the page for 800px

  - name: extract headphones info
    flow:
      - aiQuery: >
          {name: string, price: number, subTitle: string}[], return item name, price and the subTitle on the lower right corner of each item
        name: headphones

  - name: assert Filter button
    flow:
      - aiAssert: There is a Filter button on the page

Use the javascript SDK

Use the javascript SDK to do the automation by code.

import { AndroidAgent, AndroidDevice, getConnectedDevices } from '@midscene/android';
import "dotenv/config"; // read environment variables from .env file

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
Promise.resolve(
  (async () => {
    const devices = await getConnectedDevices();
    const page = new AndroidDevice(devices[0].udid);

    // 👀 init Midscene agent
    const agent = new AndroidAgent(page,{
      aiActionContext:
        'If any location, permission, user agreement, etc. popup, click agree. If login page pops up, close it.',
    });
    await page.connect();
    await page.launch('https://www.ebay.com');

    await sleep(5000);

    // 👀 type keywords, perform a search
    await agent.aiAction('type "Headphones" in search box, hit Enter');

    // 👀 wait for the loading
    await agent.aiWaitFor("there is at least one headphone item on page");
    // or you may use a plain sleep:
    // await sleep(5000);

    // 👀 understand the page content, find the items
    const items = await agent.aiQuery(
      "{itemTitle: string, price: Number}[], find item in list and corresponding price"
    );
    console.log("headphones in stock", items);

    // 👀 assert by AI
    await agent.aiAssert("There is a category filter on the left");
  })()
);

Two style APIs to do interaction

The auto-planning style:

await agent.ai('input "Headphones" in search box, hit Enter');

The instant action style:

await agent.aiInput('Headphones', 'search box');
await agent.aiKeyboardPress('Enter');

Demo projects

We have prepared a demo project for javascript SDK:

JavaScript demo project

If you want to use the automation for testing purpose, you can use the javascript with vitest. We have setup a demo project for you to see how it works:

Vitest demo project

You can also write the automation scripts by yaml file:

YAML demo project

Limitations

  1. Caching feature for element locator is not supported. Since no view-hierarchy is collected, we cannot cache the element identifier and reuse it.
  2. LLMs like gpt-4o or deepseek are not supported. Only some known vl models with visual grounding ability are supported for now. If you want to introduce other vl models, please let us know.
  3. The performance is not good enough for now. We are still working on it.
  4. The vl model may not perform well on .aiQuery and .aiAssert. We will give a way to switch model for different kinds of tasks.
  5. Due to some security restrictions, you may got a blank screenshot for the password input and Midscene will not be able to work for that.

Credits

We would like to thank the following projects:

  • scrcpy and yume-chan allow us to control Android devices with browser.
  • appium-adb for the javascript bridge of adb.
  • YADB for the yadb tool which improves the performance of text input.