Integrate with any interface

You can use Midscene Agent to control any interface—such as IoT devices, in-house apps, and in-vehicle displays—by implementing a UI operation class that conforms to AbstractInterface.

After implementing the UI operation class, you get the full capabilities of Midscene Agent:

  • the TypeScript GUI Automation Agent SDK, supporting integration with any interface
  • the playground for debugging
  • controlling the interface with YAML scripts
  • an MCP service that exposes UI actions

Demo and community project

We have prepared a demo project for you to learn how to define your own interface class. It's highly recommended to check it out.

  • Demo Project - A simple demo project that shows how to define your own interface class

  • Android (adb) Agent - This is the Android (adb) Agent for Midscene that implements this feature

  • iOS (WebDriverAgent) Agent - This is the iOS (WebDriverAgent) Agent for Midscene that implements this feature

There are also some community projects that use this feature:

  • midscene-ios - A project driving the OSX "iPhone Mirroring" app with Midscene

Set up API keys for model

Set your model configs into the environment variables. You may refer to Model strategy for more details.

export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"

For more configuration details, please refer to Model strategy and Model configuration.

Implement your own interface class

Key concepts

  • The AbstractInterface class: a predefined abstract class that can connect to the Midscene Agent
  • The action space: a set of actions that describe the actions that can be performed on the interface. This will affect how the AI model plans the actions and executes them

Step 1. Clone and setup from the demo project

We provide a demo project that runs all the features of this document below. It's the fastest way to get started.

# prepare the environment
git clone https://github.com/web-infra-dev/midscene-example.git
cd midscene-example/custom-interface
npm install
npm run build

# run the demo
npm run demo

Step 2. Implement your interface class

Define a class that extends the AbstractInterface class, and implement the required methods.

You can get the sample implementation from the ./src/sample-device.ts file. Let's take a glance at it.

import type { DeviceAction, Size } from '@midscene/core';
import { getMidsceneLocationSchema, z } from '@midscene/core';
import {
  type AbstractInterface,
  defineAction,
  defineActionTap,
  defineActionInput,
  // ... other action imports
} from '@midscene/core/device';

export interface SampleDeviceConfig {
  deviceName?: string;
  width?: number;
  height?: number;
  dpr?: number;
}

/**
 * SampleDevice - A template implementation of AbstractInterface
 */
export class SampleDevice implements AbstractInterface {
  interfaceType = 'sample-device';
  private config: Required<SampleDeviceConfig>;

  constructor(config: SampleDeviceConfig = {}) {
    this.config = {
      deviceName: config.deviceName || 'Sample Device',
      width: config.width || 1920,
      height: config.height || 1080,
      dpr: config.dpr || 1,
    };
  }

  /**
   * Required: Take a screenshot and return base64 string
   */
  async screenshotBase64(): Promise<string> {
    // TODO: Implement actual screenshot capture
    console.log('📸 Taking screenshot...');
    return 'data:image/png;base64,...'; // Your screenshot implementation
  }

  /**
   * Required: Get interface dimensions
   */
  async size(): Promise<Size> {
    return {
      width: this.config.width,
      height: this.config.height,
      dpr: this.config.dpr,
    };
  }

  /**
   * Required: Define available actions for AI model
   */
  actionSpace(): DeviceAction[] {
    return [
      // Basic tap action
      defineActionTap(async (param) => {
        // TODO: Implement tap at param.locate.center coordinates
        await this.performTap(param.locate.center[0], param.locate.center[1]);
      }),

      // Text input action  
      defineActionInput(async (param) => {
        // TODO: Implement text input
        await this.performInput(param.locate.center[0], param.locate.center[1], param.value);
      }),

      // Custom action example
      defineAction({
        name: 'CustomAction',
        description: 'Your custom device-specific action',
        paramSchema: z.object({
          locate: getMidsceneLocationSchema(),
          // ... custom parameters
        }),
        call: async (param) => {
          // TODO: Implement custom action
        },
      }),
    ];
  }

  async destroy(): Promise<void> {
    // TODO: Cleanup resources
  }

  // Private implementation methods
  private async performTap(x: number, y: number): Promise<void> {
    // TODO: Your actual tap implementation
  }

  private async performInput(x: number, y: number, text: string): Promise<void> {
    // TODO: Your actual input implementation  
  }
}

The key methods that you need to implement are:

  • screenshotBase64(), size(): help the AI model to get the context of the interface
  • actionSpace(): an array of DeviceAction objects defining the actions that can be performed on the interface. AI model will use these actions to perform the actions. Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

Use these commands to run the agent:

  • npm run build to rebuild the agent
  • npm run demo to run the agent with javascript
  • npm run demo:yaml to run the agent with yaml script

Step 3. Test the agent with the playground

Attach a playground server to the agent, and you can test the agent in the web browser.

import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

// launch playground
const server = await playgroundForAgent(agent).launch();

// close playground
await sleep(10 * 60 * 1000);
await server.close();
console.log('Playground closed!');

Step 4. Test the MCP server

(still in progress)

Step 5. Release the npm package, and let your users use it

The agent and interface class have been exported in ./index.ts file. Now you can publish it to npm.

Fill the name and version in the package.json file, and then run the following command:

npm publish

A typical usage of your npm package is like this:

import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

await agent.aiAct('click the button');

Step 6. Invoke your class in Midscene CLI and YAML script

Write a yaml script with the interface section to invoke your class.

interface:
  module: 'my-pkg-name'
  # export: 'MyDeviceClass' # use this if this is a named export

config: 
  output: './data.json'

This config works same as this:

import MyDeviceClass from 'my-pkg-name';
const device = new MyDeviceClass();
const agent = new Agent(device, {
  output: './data.json',
});

Other fields in the yaml script are the same as the yaml script.

API reference

Agent constructor

import { Agent } from '@midscene/core/agent';
import type { AbstractInterface } from '@midscene/core';

Create an agent by pairing your custom AbstractInterface implementation with the standard constructor:

const device = new SampleDevice({
  deviceName: 'Demo Panel',
  width: 1920,
  height: 1080,
});

const agent = new Agent(device, {
  actionContext: 'Dismiss pop-ups before every task.',
  generateReport: true,
  customActions: [], // optional extra DeviceAction list
});
  • device: AbstractInterface (required): Any class that fulfills screenshotBase64, size, and actionSpace. This is where you translate Midscene actions into real I/O calls for your hardware or desktop app.
  • options?: PageAgentOpt: Shares the same option bag as the browser and mobile agents described in the API constructors. Commonly used fields include generateReport, reportFileName, actionContext/aiActionContext, cacheId, modelConfig, createOpenAIClient, customActions, and onTaskStartTip.
  • The resulting agent instantly unlocks the regular automation surfaces: aiAct/aiTap APIs, YAML runner (interface block), playground, MCP server, and reporting pipeline.

AbstractInterface class

import { AbstractInterface } from '@midscene/core';

AbstractInterface is the key class for the agent to control the interface.

These are the required methods that you need to implement:

  • interfaceType: string: define a name for the interface, this will not be provided to the AI model
  • screenshotBase64(): Promise<string>: take a screenshot of the interface and return the base64 string with the 'data:image/ prefix
  • size(): Promise<Size>: the size and dpr of the interface, which is an object with the width, height, and dpr properties
  • actionSpace(): DeviceAction[] | Promise<DeviceAction[]>: the action space of the interface, which is an array of DeviceAction objects. Use predefined actions or define any custom action.

Type signatures:

import type { DeviceAction, Size, UIContext } from '@midscene/core';
import type { ElementNode } from '@midscene/shared/extractor';

abstract class AbstractInterface {
  // Required
  abstract interfaceType: string;
  abstract screenshotBase64(): Promise<string>;
  abstract size(): Promise<Size>;
  abstract actionSpace(): DeviceAction[] | Promise<DeviceAction[]>;

  // Optional lifecycle/hooks
  abstract destroy?(): Promise<void>;
  abstract describe?(): string;
  abstract beforeInvokeAction?(actionName: string, param: any): Promise<void>;
  abstract afterInvokeAction?(actionName: string, param: any): Promise<void>;
}

These are the optional methods that you can implement:

  • destroy?(): Promise<void>: destroy the interface
  • describe?(): string: describe the interface, this may be used for the report and the playground. But it will not be provided to the AI model.
  • beforeInvokeAction?(actionName: string, param: any): Promise<void>: a hook function before invoking an action in action space
  • afterInvokeAction?(actionName: string, param: any): Promise<void>: a hook function after invoking an action

The action space

Action space is the set of actions that can be performed on the interface. AI model will use these actions to perform the actions. All the descriptions and parameter schemas of the actions will be provided to the AI model.

To help you easily define the action space, Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

This is how you can import the utils to define the action space:

import {
	type ActionTapParam,
	defineAction,
	defineActionTap,
} from "@midscene/core/device";

The predefined actions

These are the predefined action spaces for the most common interfaces and devices. You can expose them to the customized interface by implementing the call method of the action.

You can find the parameters of the actions in the type definition of these functions.

  • defineActionTap(): define the tap action. This is also the function to invoke for the aiTap method.
  • defineActionDoubleClick(): define the double click action
  • defineActionInput(): define the input action. This is also the function to invoke for the aiInput method. This is also the function to invoke for the aiInput method.
  • defineActionKeyboardPress(): define the keyboard press action. This is also the function to invoke for the aiKeyboardPress method.
  • defineActionScroll(): define the scroll action. This is also the function to invoke for the aiScroll method.
  • defineActionDragAndDrop(): define the drag and drop action
  • defineActionLongPress(): define the long press action
  • defineActionSwipe(): define the swipe action

Define a custom action

You can define your own action by using the defineAction() function. You can also use this method to define more actions for the PuppeteerAgent, AgentOverChromeBridge, and AndroidAgent.

API Signature:

import { defineAction } from "@midscene/core/device";

defineAction(
  {
    name: string,
    description: string,
    paramSchema: z.ZodType<T>;
    call: (param: z.infer<z.ZodType<T>>) => Promise<void>;
  }
)
  • name: the name of the action, AI model will use this name to invoke the action
  • description: the description of the action, AI model will use this description to understand what the action is doing. For complex actions, you can provide a more detailed example here.
  • paramSchema: the Zod schema of the parameters of the action, AI model will help to fill the parameters according to this schema
  • call: the function to invoke the action, you can get the parameters from the param parameter which conforms to the paramSchema

Example:

defineAction({
  name: 'MyAction',
  description: 'My action',
  paramSchema: z.object({
    name: z.string(),
  }),
  call: async (param) => {
    console.log(param.name);
  },
});

If you want to get a param about the location of some element, you can use the getMidsceneLocationSchema() function to get the specific zod schema.

A more complex example about defining a custom action:

import { getMidsceneLocationSchema } from "@midscene/core/device";

defineAction({
  name: 'LaunchApp',
  description: 'A an app on screen',
  paramSchema: z.object({
    name: z.string().describe('The name of the app to launch'),
    locate: getMidsceneLocationSchema().describe('The app icon to be launched'),
  }),
  call: async (param) => {
    console.log(`launching app: ${param.name}, ui located at: ${JSON.stringify(param.locate.center)}`);
  },
});

playgroundForAgent function

import { playgroundForAgent } from '@midscene/playground';

The playgroundForAgent function creates a playground launcher for a specific Agent, allowing you to test and debug your custom interface Agent in a web browser.

Function signature

function playgroundForAgent(agent: Agent): {
  launch(options?: LaunchPlaygroundOptions): Promise<LaunchPlaygroundResult>
}

Parameters

  • agent: Agent: The Agent instance to launch the playground for

Return value

Returns an object containing a launch method.

launch method options

interface LaunchPlaygroundOptions {
  /**
   * Port to start the playground server on
   * @default 5800
   */
  port?: number;

  /**
   * Whether to automatically open the playground in browser
   * @default true
   */
  openBrowser?: boolean;

  /**
   * Custom browser command to open playground
   * @default 'open' on macOS, 'start' on Windows, 'xdg-open' on Linux
   */
  browserCommand?: string;

  /**
   * Whether to show server logs
   * @default true
   */
  verbose?: boolean;

  /**
   * Unique identifier for the playground server instance
   * Same ID shares playground chat history
   * @default undefined (generates random UUID)
   */
  id?: string;
}

launch method return value

interface LaunchPlaygroundResult {
  /**
   * The playground server instance
   */
  server: PlaygroundServer;

  /**
   * The server port
   */
  port: number;

  /**
   * The server host
   */
  host: string;

  /**
   * Function to close the playground
   */
  close: () => Promise<void>;
}

Usage example

import 'dotenv/config';
import { playgroundForAgent } from '@midscene/playground';
import { SampleDevice } from './sample-device';
import { Agent } from '@midscene/core/agent';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// Create device and agent instances
const device = new SampleDevice();
const agent = new Agent(device);

// Launch playground
const result = await playgroundForAgent(agent).launch({
  port: 5800,
  openBrowser: true,
  verbose: true
});

console.log(`Playground started: http://${result.host}:${result.port}`);

// Close playground when needed
await sleep(10 * 60 * 1000); // Wait 10 minutes
await result.close();
console.log('Playground closed!');

FAQ

My interface-controller is general-purpose; can it be included in this document?

Yes, we are happy to gather creative projects and list them in this document.

Raise an issue to us when it's ready.