Support Android Automation
From Midscene v0.15, we are happy to announce the support for Android automation. The era for AI-driven Android automation is here!
Showcases
Navigation to attraction
Open Maps, search for a destination, and navigate to it.
Open Twitter, auto-like the first tweet by @midscene_ai.
Suitable for ALL apps
For our developers, all you need is the adb connection and a visual-language model (vl-model) service. Everything is ready!
Behind the scenes, we utilize the visual grounding capabilities of vl-model to locate target elements on the screen. So, regardless of whether it's a native app or a hybrid app with a webview, it makes no difference. Developers can write automation scripts without the burden of worrying about the technology stack of the app.
With ALL the power of Midscene
When using Midscene to do web automation, our users loves the tools like playgrounds and reports. Now, we bring the same power to Android automation!
Use the playground to run automation without any code
Use the report to replay the whole process
Write the automation scripts by yaml file
Connect to the device, open ebay.com, and get some items info.
# search headphone on ebay, extract the items info into a json file, and assert the shopping cart icon
android:
deviceId: s4ey59
tasks:
- name: search headphones
flow:
- aiAction: open browser and navigate to ebay.com
- aiAction: type 'Headphones' in ebay search box, hit Enter
- sleep: 5000
- aiAction: scroll down the page for 800px
- name: extract headphones info
flow:
- aiQuery: >
{name: string, price: number, subTitle: string}[], return item name, price and the subTitle on the lower right corner of each item
name: headphones
- name: assert Filter button
flow:
- aiAssert: There is a Filter button on the page
Use the javascript SDK
Use the javascript SDK to do the automation by code.
import { AndroidAgent, AndroidDevice, getConnectedDevices } from '@midscene/android';
import "dotenv/config"; // read environment variables from .env file
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
Promise.resolve(
(async () => {
const devices = await getConnectedDevices();
const page = new AndroidDevice(devices[0].udid);
// 👀 init Midscene agent
const agent = new AndroidAgent(page,{
aiActionContext:
'If any location, permission, user agreement, etc. popup, click agree. If login page pops up, close it.',
});
await page.connect();
await page.launch('https://www.ebay.com');
await sleep(5000);
// 👀 type keywords, perform a search
await agent.aiAction('type "Headphones" in search box, hit Enter');
// 👀 wait for the loading
await agent.aiWaitFor("there is at least one headphone item on page");
// or you may use a plain sleep:
// await sleep(5000);
// 👀 understand the page content, find the items
const items = await agent.aiQuery(
"{itemTitle: string, price: Number}[], find item in list and corresponding price"
);
console.log("headphones in stock", items);
// 👀 assert by AI
await agent.aiAssert("There is a category filter on the left");
})()
);
Two style APIs to do interaction
The auto-planning style:
await agent.ai('input "Headphones" in search box, hit Enter');
The instant action style:
await agent.aiInput('Headphones', 'search box');
await agent.aiKeyboardPress('Enter');
Demo projects
We have prepared a demo project for javascript SDK:
JavaScript demo project
If you want to use the automation for testing purpose, you can use the javascript with vitest. We have setup a demo project for you to see how it works:
Vitest demo project
You can also write the automation scripts by yaml file:
YAML demo project
Limitations
- Caching feature for element locator is not supported. Since no view-hierarchy is collected, we cannot cache the element identifier and reuse it.
- LLMs like gpt-4o or deepseek are not supported. Only some known vl models with visual grounding ability are supported for now. If you want to introduce other vl models, please let us know.
- The performance is not good enough for now. We are still working on it.
- The vl model may not perform well on
.aiQuery
and .aiAssert
. We will give a way to switch model for different kinds of tasks.
- Due to some security restrictions, you may got a blank screenshot for the password input and Midscene will not be able to work for that.
Credits
We would like to thank the following projects:
- scrcpy and yume-chan allow us to control Android devices with browser.
- appium-adb for the javascript bridge of adb.
- YADB for the yadb tool which improves the performance of text input.