Integrate with any interface (preview)

From Midscene v0.28.0, we have launched the feature to integrate with any interface. Define your own interface controller class conforming to the AbstractInterface class, and you can get a fully-featured Midscene Agent.

A typical usage of this feature is to build a GUI Automation Agent for your own interface, like an IoT device, an in-house app, a display in car, etc.

After implementing the class, you can have all these popular features:

  • the GUI Automation Agent SDK
  • the playground for debugging
  • control this interface by yaml script
  • the MCP server
  • all the features of Midscene Agent

Please note that only models with visual grounding capabilities can be used to control the interface. Read this doc to choose a model.

This is a preview feature

This feature is still in the preview stage, and we welcome your feedback and suggestions on GitHub.

Demo and community project

We have prepared a demo project for you to learn how to define your own interface class. It's highly recommended to check it out.

  • Demo Project - A simple demo project that shows how to define your own interface class

  • Android (adb) Agent - This is the Android (adb) Agent for Midscene that implements this feature

There are also some community projects that use this feature:

  • midscene-ios - A project driving the OSX "iPhone Mirroring" app with Midscene

Setup AI model service

Set your model configs into the environment variables. You may refer to choose a model for more details.

# replace with your own
export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"

# You may need more configs, such as model name and endpoint, please refer to [choose a model](../choose-a-model)
export OPENAI_BASE_URL="..."

Implement your own interface class

Key concepts

  • The AbstractInterface class: a predefined abstract class that can connect to the Midscene Agent
  • The action space: a set of actions that describe the actions that can be performed on the interface. This will affect how the AI model plans the actions and executes them

Step 1. clone and setup from the demo project

We provide a demo project that runs all the features of this document below. It's the fastest way to get started.

# prepare the environment
git clone https://github.com/web-infra-dev/midscene-example.git
cd midscene-example/custom-interface
npm install
npm run build

# run the demo
npm run demo

Step 2. implement your interface class

Define a class that extends the AbstractInterface class, and implement the required methods.

You can get the sample implementation from the ./src/sample-device.ts file. Let's take a glance at it.

import type { DeviceAction, Size } from '@midscene/core';
import { getMidsceneLocationSchema, z } from '@midscene/core';
import {
  type AbstractInterface,
  defineAction,
  defineActionTap,
  defineActionInput,
  // ... other action imports
} from '@midscene/core/device';

export interface SampleDeviceConfig {
  deviceName?: string;
  width?: number;
  height?: number;
  dpr?: number;
}

/**
 * SampleDevice - A template implementation of AbstractInterface
 */
export class SampleDevice implements AbstractInterface {
  interfaceType = 'sample-device';
  private config: Required<SampleDeviceConfig>;

  constructor(config: SampleDeviceConfig = {}) {
    this.config = {
      deviceName: config.deviceName || 'Sample Device',
      width: config.width || 1920,
      height: config.height || 1080,
      dpr: config.dpr || 1,
    };
  }

  /**
   * Required: Take a screenshot and return base64 string
   */
  async screenshotBase64(): Promise<string> {
    // TODO: Implement actual screenshot capture
    console.log('📸 Taking screenshot...');
    return 'data:image/png;base64,...'; // Your screenshot implementation
  }

  /**
   * Required: Get interface dimensions
   */
  async size(): Promise<Size> {
    return {
      width: this.config.width,
      height: this.config.height,
      dpr: this.config.dpr,
    };
  }

  /**
   * Required: Define available actions for AI model
   */
  actionSpace(): DeviceAction[] {
    return [
      // Basic tap action
      defineActionTap(async (param) => {
        // TODO: Implement tap at param.locate.center coordinates
        await this.performTap(param.locate.center[0], param.locate.center[1]);
      }),

      // Text input action  
      defineActionInput(async (param) => {
        // TODO: Implement text input
        await this.performInput(param.locate.center[0], param.locate.center[1], param.value);
      }),

      // Custom action example
      defineAction({
        name: 'CustomAction',
        description: 'Your custom device-specific action',
        paramSchema: z.object({
          locate: getMidsceneLocationSchema(),
          // ... custom parameters
        }),
        call: async (param) => {
          // TODO: Implement custom action
        },
      }),
    ];
  }

  async destroy(): Promise<void> {
    // TODO: Cleanup resources
  }

  // Private implementation methods
  private async performTap(x: number, y: number): Promise<void> {
    // TODO: Your actual tap implementation
  }

  private async performInput(x: number, y: number, text: string): Promise<void> {
    // TODO: Your actual input implementation  
  }
}

The key methods that you need to implement are:

  • screenshotBase64(), size(): help the AI model to get the context of the interface
  • actionSpace(): an array of DeviceAction objects defining the actions that can be performed on the interface. AI model will use these actions to perform the actions. Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

Use these commands to run the agent:

  • npm run build to rebuild the agent
  • npm run demo to run the agent with javascript
  • npm run demo:yaml to run the agent with yaml script

Step 3. test the agent with the playground

Attach a playground server to the agent, and you can test the agent in the web browser.

import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

// launch playground
const server = await playgroundForAgent(agent).launch();

// close playground
await sleep(10 * 60 * 1000);
await server.close();
console.log('Playground closed!');

Step 4. test the MCP server

(still in progress)

Step 5. release the npm package, and let your users to use it

The agent and interface class have been exported in ./index.ts file. Now you can publish it to npm.

Fill the name and version in the package.json file, and then run the following command:

npm publish

A typical usage of your npm package is like this:

import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

await agent.aiAction('click the button');

Step 6. invoke your class in Midscene cli and yaml script

Write a yaml script with the interface section to invoke your class.

interface:
  module: 'my-pkg-name'
  # export: 'MyDeviceClass' # use this if this is a named export

config: 
  output: './data.json'

This config works same as this:

import MyDeviceClass from 'my-pkg-name';
const device = new MyDeviceClass();
const agent = new Agent(device, {
  output: './data.json',
});

Other fields in the yaml script are the same as the yaml script.

API Reference

AbstractInterface class

import { AbstractInterface } from '@midscene/core';

AbstractInterface is the key class for the agent to control the interface.

These are the required methods that you need to implement:

  • interfaceType: string: define a name for the interface, this will not be provided to the AI model
  • screenshotBase64(): Promise<string>: take a screenshot of the interface and return the base64 string with the 'data:image/ prefix
  • size(): Promise<Size>: the size and dpr of the interface, which is an object with the width, height, and dpr properties
  • actionSpace(): DeviceAction[] | Promise<DeviceAction[]>: the action space of the interface, which is an array of DeviceAction objects. Use predefined actions or define any custom action.

Type signatures:

import type { DeviceAction, Size, UIContext } from '@midscene/core';
import type { ElementNode } from '@midscene/shared/extractor';

abstract class AbstractInterface {
  // Required
  abstract interfaceType: string;
  abstract screenshotBase64(): Promise<string>;
  abstract size(): Promise<Size>;
  abstract actionSpace(): DeviceAction[] | Promise<DeviceAction[]>;

  // Optional lifecycle/hooks
  abstract destroy?(): Promise<void>;
  abstract describe?(): string;
  abstract beforeInvokeAction?(actionName: string, param: any): Promise<void>;
  abstract afterInvokeAction?(actionName: string, param: any): Promise<void>;
}

These are the optional methods that you can implement:

  • destroy?(): Promise<void>: destroy the interface
  • describe?(): string: describe the interface, this may be used for the report and the playground. But it will not be provided to the AI model.
  • beforeInvokeAction?(actionName: string, param: any): Promise<void>: a hook function before invoking an action in action space
  • afterInvokeAction?(actionName: string, param: any): Promise<void>: a hook function after invoking an action

The action space

Action space is the set of actions that can be performed on the interface. AI model will use these actions to perform the actions. All the descriptions and parameter schemas of the actions will be provided to the AI model.

To help you easily define the action space, Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

This is how you can import the utils to define the action space:

import {
	type ActionTapParam,
	defineAction,
	defineActionTap,
} from "@midscene/core/device";

The predefined actions

These are the predefined action spaces for the most common interfaces and devices. You can expose them to the customized interface by implementing the call method of the action.

You can find the parameters of the actions in the type definition of these functions.

  • defineActionTap(): define the tap action. This is also the function to invoke for the aiTap method.
  • defineActionDoubleClick(): define the double click action
  • defineActionInput(): define the input action. This is also the function to invoke for the aiInput method. This is also the function to invoke for the aiInput method.
  • defineActionKeyboardPress(): define the keyboard press action. This is also the function to invoke for the aiKeyboardPress method.
  • defineActionScroll(): define the scroll action. This is also the function to invoke for the aiScroll method.
  • defineActionDragAndDrop(): define the drag and drop action
  • defineActionLongPress(): define the long press action
  • defineActionSwipe(): define the swipe action

Define a custom action

You can define your own action by using the defineAction() function.

API Signature:

import { defineAction } from "@midscene/core/device";

defineAction(
  {
    name: string,
    description: string,
    paramSchema: z.ZodType<T>;
    call: (param: z.infer<z.ZodType<T>>) => Promise<void>;
  }
)
  • name: the name of the action, AI model will use this name to invoke the action
  • description: the description of the action, AI model will use this description to understand what the action is doing. For complex actions, you can provide a more detailed example here.
  • paramSchema: the Zod schema of the parameters of the action, AI model will help to fill the parameters according to this schema
  • call: the function to invoke the action, you can get the parameters from the param parameter which conforms to the paramSchema

Example:

defineAction({
  name: 'MyAction',
  description: 'My action',
  paramSchema: z.object({
    name: z.string(),
  }),
  call: async (param) => {
    console.log(param.name);
  },
});

If you want to get a param about the location of some element, you can use the getMidsceneLocationSchema() function to get the specific zod schema.

A more complex example about defining a custom action:

import { getMidsceneLocationSchema } from "@midscene/core/device";

defineAction({
  name: 'LaunchApp',
  description: 'A an app on screen',
  paramSchema: z.object({
    name: z.string().describe('The name of the app to launch'),
    locate: getMidsceneLocationSchema().describe('The app icon to be launched'),
  }),
  call: async (param) => {
    console.log(`launching app: ${param.name}, ui located at: ${JSON.stringify(param.locate.center)}`);
  },
});

playgroundForAgent function

import { playgroundForAgent } from '@midscene/playground';

The playgroundForAgent function creates a playground launcher for a specific Agent, allowing you to test and debug your custom interface Agent in a web browser.

Function Signature

function playgroundForAgent(agent: Agent): {
  launch(options?: LaunchPlaygroundOptions): Promise<LaunchPlaygroundResult>
}

Parameters

  • agent: Agent: The Agent instance to launch the playground for

Return Value

Returns an object containing a launch method.

launch Method Options

interface LaunchPlaygroundOptions {
  /**
   * Port to start the playground server on
   * @default 5800
   */
  port?: number;

  /**
   * Whether to automatically open the playground in browser
   * @default true
   */
  openBrowser?: boolean;

  /**
   * Custom browser command to open playground
   * @default 'open' on macOS, 'start' on Windows, 'xdg-open' on Linux
   */
  browserCommand?: string;

  /**
   * Whether to show server logs
   * @default true
   */
  verbose?: boolean;

  /**
   * Unique identifier for the playground server instance
   * Same ID shares playground chat history
   * @default undefined (generates random UUID)
   */
  id?: string;
}

launch Method Return Value

interface LaunchPlaygroundResult {
  /**
   * The playground server instance
   */
  server: PlaygroundServer;

  /**
   * The server port
   */
  port: number;

  /**
   * The server host
   */
  host: string;

  /**
   * Function to close the playground
   */
  close: () => Promise<void>;
}

Usage Example

import 'dotenv/config';
import { playgroundForAgent } from '@midscene/playground';
import { SampleDevice } from './sample-device';
import { Agent } from '@midscene/core/agent';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// Create device and agent instances
const device = new SampleDevice();
const agent = new Agent(device);

// Launch playground
const result = await playgroundForAgent(agent).launch({
  port: 5800,
  openBrowser: true,
  verbose: true
});

console.log(`Playground started: http://${result.host}:${result.port}`);

// Close playground when needed
await sleep(10 * 60 * 1000); // Wait 10 minutes
await result.close();
console.log('Playground closed!');

FAQ

Can i use normal LLM models like GPT-4o to control the interface?

No, you cannot use normal LLM models like GPT-4o to control the interface. You must use a model with visual grounding capabilities. Models with visual grounding capabilities can locate the target elements on the page and return the coordinates of the elements, and they can dramatically improve the stability of the automation.

Read this doc to choose a model.

Can my interface-controller be recommended in this document?

Yes, we are happy to gather creative projects and list them in this document.

Raise an issue to us when it's ready.