PC Desktop Getting Started

This guide walks you through everything required to automate PC desktop applications with Midscene: install dependencies, configure model credentials, and run your first JavaScript script.

Demo Projects

Control PC desktop with JavaScript: https://github.com/web-infra-dev/midscene-example/tree/main/computer/javascript-sdk-demo

Integrate Vitest for testing: https://github.com/web-infra-dev/midscene-example/tree/main/computer/vitest-demo

Set up API keys for model

Set your model configs into the environment variables. You may refer to Model strategy for more details.

export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"

For more configuration details, please refer to Model strategy and Model configuration.

System Requirements

Node.js

Node.js 18.19.0 or higher is required.

Platform-Specific Dependencies

macOS: Accessibility permissions are required for keyboard and mouse control. When you run the script for the first time, macOS will prompt you to grant access. Go to System Settings > Privacy & Security > Accessibility and enable permissions for the application running your script (e.g., Terminal, iTerm2, VS Code, WebStorm, or other IDEs). For more details, see nut.js macOS setup.

Linux: ImageMagick is required for screenshot functionality.

Headless Linux (CI): To run desktop automation on a headless Linux server (e.g. GitHub Actions), install Xvfb and its dependencies, then enable headless mode:

# Install dependencies
sudo apt-get install -y xvfb x11-xserver-utils imagemagick

// Option 1: Pass headless option
const agent = await agentFromComputer({ headless: true });

// Option 2: Set environment variable
// MIDSCENE_COMPUTER_HEADLESS_LINUX=true npx tsx example.ts

Xvfb creates a virtual display so that mouse, keyboard, and screenshot operations work without a physical monitor. See API Reference for details.

Example: Testing Electron Apps on Headless Linux CI

A complete demo of testing Obsidian (an Electron app) on headless Linux CI with @midscene/computer: https://github.com/web-infra-dev/midscene-example/tree/main/computer/electron-demo

Try Playground (no code)

Playground is the fastest way to validate the connection and observe AI-driven steps without writing code. It shares the same core as @midscene/computer, so anything that works here will behave the same once scripted.

Launch the Playground CLI:

npx --yes @midscene/computer-playground

Click the gear icon in the Playground window, then paste your API key configuration. Refer back to Model configuration if you still need credentials.

Start experiencing

After configuration, you can start using Midscene right away. It provides several key operation tabs:

Act: interact with the page. This is Auto Planning, corresponding to aiAct. For example:

Type “Midscene” in the search box, run the search, and open the first result

Fill out the registration form and make sure every field passes validation

Query: extract JSON data from the interface, corresponding to aiQuery.

Similar methods include aiBoolean(), aiNumber(), and aiString() for directly extracting booleans, numbers, and strings.

Extract the user ID from the page and return JSON data in the { id: string } structure

Assert: understand the page and assert; if the condition is not met, throw an error, corresponding to aiAssert.

There is a login button on the page, with a user agreement link below it

Tap: click on an element. This is Instant Action, corresponding to aiTap.

Click the login button

For the difference between Auto Planning and Instant Action, see the API document.

Integration with Midscene Agent

Once Playground works, move to a repeatable script with the JavaScript SDK.

Step 1. Install dependencies

npm

yarn

pnpm

bun

deno

npm install @midscene/computer

Step 2. Write your first script

Create example.ts:

import { agentFromComputer } from '@midscene/computer';

(async () => {
  // Create an agent
  const agent = await agentFromComputer({
    aiActionContext: 'You are controlling a desktop computer.',
  });

  // Take a screenshot and query information
  const screenInfo = await agent.aiQuery(
    '{width: number, height: number}, get screen resolution'
  );
  console.log('Screen resolution:', screenInfo);

  // Move mouse to center
  await agent.aiAct('move mouse to center of screen');

  // Assert screen has content
  await agent.aiAssert('The screen has visible content');

  console.log('Desktop automation completed!');
})();

Step 3. Run the script

npx tsx example.ts

Multi-Display Support

If you have multiple displays, you can control a specific one:

import { ComputerDevice, agentFromComputer } from '@midscene/computer';

// List all displays
const displays = await ComputerDevice.listDisplays();
console.log('Available displays:', displays);

// Connect to a specific display
const agent = await agentFromComputer({
  displayId: displays[0].id,
});

Example Usage

Basic Mouse Operations

// Click at center of screen
await agent.aiAct('click mouse at center of screen');

// Move mouse to a specific location
await agent.aiAct('move mouse to top-left corner');

// Double-click
await agent.aiAct('double-click on the desktop icon');

// Right-click
await agent.aiAct('right-click to open context menu');

Keyboard Operations

// Type text
await agent.aiAct('type "Hello World"');

// Press keyboard shortcuts
if (process.platform === 'darwin') {
  await agent.aiAct('press Cmd+Space to open Spotlight');
  await agent.aiAct('type "Calculator" and press Enter');
} else {
  await agent.aiAct('press Windows key');
  await agent.aiAct('type "Calculator" and press Enter');
}

// Press function keys
await agent.aiAct('press Escape');
await agent.aiAct('press Enter');

Query Information

// Extract screen information
const info = await agent.aiQuery(
  '{hasDesktop: boolean, visibleApps: string[]}, check if desktop is visible and list visible apps'
);

// Locate elements
const position = await agent.aiLocate('the File menu');
console.log('File menu position:', position);

Complex Workflows

// Open an application and interact with it
await agent.aiAct('open Finder');
await agent.aiWaitFor('Finder window is visible');

await agent.aiAct('click on Documents folder');
await agent.aiAct('press Cmd+N to create new folder');
await agent.aiAct('type "My Project"');
await agent.aiAct('press Enter');

await agent.aiAssert('A folder named "My Project" exists');

Environment Check

You can check if your system is properly configured:

import { checkComputerEnvironment } from '@midscene/computer';

const env = await checkComputerEnvironment();
console.log('Platform:', env.platform);
console.log('Available:', env.available);
console.log('Displays:', env.displays);

if (!env.available) {
  console.error('Environment not available:', env.error);
}

#PC Desktop Getting Started

#Set up API keys for model

#System Requirements

#Node.js

#Platform-Specific Dependencies

#Try Playground (no code)

#Start experiencing

#Integration with Midscene Agent

#Step 1. Install dependencies

#Step 2. Write your first script

#Step 3. Run the script

#Multi-Display Support

#Example Usage

#Basic Mouse Operations

#Keyboard Operations

#Query Information

#Complex Workflows

#Environment Check

#Next Steps