Integrate with any interface
You can use Midscene Agent to control any interface—such as IoT devices, in-house apps, and in-vehicle displays—by implementing a UI operation class that conforms to AbstractInterface.
After implementing the UI operation class, you get the full capabilities of Midscene Agent:
- the TypeScript GUI Automation Agent SDK, supporting integration with any interface
- the playground for debugging
- controlling the interface with YAML scripts
- an MCP service that exposes UI actions
Demo and community project
We have prepared a demo project for you to learn how to define your own interface class. It's highly recommended to check it out.
-
Demo Project - A simple demo project that shows how to define your own interface class
-
Android (adb) Agent - This is the Android (adb) Agent for Midscene that implements this feature
-
iOS (WebDriverAgent) Agent - This is the iOS (WebDriverAgent) Agent for Midscene that implements this feature
There are also some community projects that use this feature:
- midscene-ios - A project driving the OSX "iPhone Mirroring" app with Midscene
Set up API keys for model
Set your model configs into the environment variables. You may refer to Model strategy for more details.
For more configuration details, please refer to Model strategy and Model configuration.
Implement your own interface class
Key concepts
- The
AbstractInterfaceclass: a predefined abstract class that can connect to the Midscene Agent - The action space: a set of actions that describe the actions that can be performed on the interface. This will affect how the AI model plans the actions and executes them
Step 1. Clone and setup from the demo project
We provide a demo project that runs all the features of this document below. It's the fastest way to get started.
Step 2. Implement your interface class
Define a class that extends the AbstractInterface class, and implement the required methods.
You can get the sample implementation from the ./src/sample-device.ts file. Let's take a glance at it.
The key methods that you need to implement are:
screenshotBase64(),size(): help the AI model to get the context of the interfaceactionSpace(): an array ofDeviceActionobjects defining the actions that can be performed on the interface. AI model will use these actions to perform the actions. Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.
Use these commands to run the agent:
npm run buildto rebuild the agentnpm run demoto run the agent with javascriptnpm run demo:yamlto run the agent with yaml script
Step 3. Test the agent with the playground
Attach a playground server to the agent, and you can test the agent in the web browser.
Step 4. Test the MCP server
(still in progress)
Step 5. Release the npm package, and let your users use it
The agent and interface class have been exported in ./index.ts file. Now you can publish it to npm.
Fill the name and version in the package.json file, and then run the following command:
A typical usage of your npm package is like this:
Step 6. Invoke your class in Midscene CLI and YAML script
Write a yaml script with the interface section to invoke your class.
This config works same as this:
Other fields in the yaml script are the same as the yaml script.
API reference
Agent constructor
Create an agent by pairing your custom AbstractInterface implementation with the standard constructor:
device: AbstractInterface(required): Any class that fulfillsscreenshotBase64,size, andactionSpace. This is where you translate Midscene actions into real I/O calls for your hardware or desktop app.options?: PageAgentOpt: Shares the same option bag as the browser and mobile agents described in the API constructors. Commonly used fields includegenerateReport,reportFileName,actionContext/aiActionContext,cacheId,modelConfig,createOpenAIClient,customActions, andonTaskStartTip.- The resulting agent instantly unlocks the regular automation surfaces:
aiAct/aiTapAPIs, YAML runner (interfaceblock), playground, MCP server, and reporting pipeline.
AbstractInterface class
AbstractInterface is the key class for the agent to control the interface.
These are the required methods that you need to implement:
interfaceType: string: define a name for the interface, this will not be provided to the AI modelscreenshotBase64(): Promise<string>: take a screenshot of the interface and return the base64 string with the'data:image/prefixsize(): Promise<Size>: the size and dpr of the interface, which is an object with thewidth,height, anddprpropertiesactionSpace(): DeviceAction[] | Promise<DeviceAction[]>: the action space of the interface, which is an array ofDeviceActionobjects. Use predefined actions or define any custom action.
Type signatures:
These are the optional methods that you can implement:
destroy?(): Promise<void>: destroy the interfacedescribe?(): string: describe the interface, this may be used for the report and the playground. But it will not be provided to the AI model.beforeInvokeAction?(actionName: string, param: any): Promise<void>: a hook function before invoking an action in action spaceafterInvokeAction?(actionName: string, param: any): Promise<void>: a hook function after invoking an action
The action space
Action space is the set of actions that can be performed on the interface. AI model will use these actions to perform the actions. All the descriptions and parameter schemas of the actions will be provided to the AI model.
To help you easily define the action space, Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.
This is how you can import the utils to define the action space:
The predefined actions
These are the predefined action spaces for the most common interfaces and devices. You can expose them to the customized interface by implementing the call method of the action.
You can find the parameters of the actions in the type definition of these functions.
defineActionTap(): define the tap action. This is also the function to invoke for theaiTapmethod.defineActionDoubleClick(): define the double click actiondefineActionInput(): define the input action. This is also the function to invoke for theaiInputmethod. This is also the function to invoke for theaiInputmethod.defineActionKeyboardPress(): define the keyboard press action. This is also the function to invoke for theaiKeyboardPressmethod.defineActionScroll(): define the scroll action. This is also the function to invoke for theaiScrollmethod.defineActionDragAndDrop(): define the drag and drop actiondefineActionLongPress(): define the long press actiondefineActionSwipe(): define the swipe action
Define a custom action
You can define your own action by using the defineAction() function. You can also use this method to define more actions for the PuppeteerAgent, AgentOverChromeBridge, and AndroidAgent.
API Signature:
name: the name of the action, AI model will use this name to invoke the actiondescription: the description of the action, AI model will use this description to understand what the action is doing. For complex actions, you can provide a more detailed example here.paramSchema: the Zod schema of the parameters of the action, AI model will help to fill the parameters according to this schemacall: the function to invoke the action, you can get the parameters from theparamparameter which conforms to theparamSchema
Example:
If you want to get a param about the location of some element, you can use the getMidsceneLocationSchema() function to get the specific zod schema.
A more complex example about defining a custom action:
playgroundForAgent function
The playgroundForAgent function creates a playground launcher for a specific Agent, allowing you to test and debug your custom interface Agent in a web browser.
Function signature
Parameters
agent: Agent: The Agent instance to launch the playground for
Return value
Returns an object containing a launch method.
launch method options
launch method return value
Usage example
FAQ
My interface-controller is general-purpose; can it be included in this document?
Yes, we are happy to gather creative projects and list them in this document.
Raise an issue to us when it's ready.

