There are multiple agents in Midscene, each with its own constructor.
Here are the common options for all agents:
generateReport: boolean
: If true, the agent will generate a report file. Default is true.autoPrintReportMsg: boolean
: If true, the agent will print the report message. Default is true.cacheId: string | undefined
: If set, the agent will use this cacheId to match the cache. Default is undefined.And also, puppeteer agent has an extra option:
trackingActiveTab
: If true, the agent will track the newly opened tab. Default is false.These are the main methods on all kinds of agents in Midscene.
In the following documentation, you may see functions called with the
mid.
prefix. If you use destructuring in Playwright, likeasync ({ ai, aiQuery }) => { /* ... */}
, you can call the functions without this prefix. It's just a matter of syntax.
.aiAction(steps: string)
or .ai(steps: string)
- Interact with the pageYou can use .aiAction
to perform a series of actions. It accepts a steps: string
as a parameter, which describes the actions. In the prompt, you should clearly describe the steps. Midscene will take care of the rest.
.ai
is the shortcut for .aiAction
.
These are some good samples:
Steps should always be clearly and thoroughly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure.
Under the hood, Midscene will plan the detailed steps by sending your page context and a screenshot to the AI. After that, Midscene will execute the steps one by one. If Midscene deems it impossible to execute, an error will be thrown.
The main capabilities of Midscene are as follows, and your task will be split into these types. You can see them in the visualized report:
Currently, Midscene can't plan steps that include conditions and loops.
Related Docs:
.aiQuery(dataDemand: any)
- extract any data from pageYou can extract customized data from the UI. Provided that the multi-modal AI can perform inference, it can return both data directly written on the page and any data based on "understanding". The return value is in JSON format, so it should be valid primitive types, like String, Number, JSON, Array, etc. Just describe it in the dataDemand
.
For example, to parse detailed information from page:
You can also describe the expected return value format as a plain string:
.aiAssert(assertion: string, errorMsg?: string)
- do an assertion.aiAssert
works just like the normal assert
method, except that the condition is a prompt string written in natural language. Midscene will call AI to determine if the assertion
is true. If the condition is not met, an error will be thrown containing errorMsg
and a detailed reason generated by AI.
Assertions are usually a very important part of your script. To prevent the possibility of AI hallucinations ( especially for the false negative situation ), you can also use .aiQuery
+ normal JavaScript assertions to replace the .aiAssert
calls.
For example, to replace the previous assertion,
.aiWaitFor(assertion: string, {timeoutMs?: number, checkIntervalMs?: number })
- wait until the assertion is met.aiWaitFor
will help you check if your assertion has been met or a timeout error occurred. Considering the AI service cost, the check interval will not exceed checkIntervalMs
milliseconds. The default config sets timeoutMs
to 15 seconds and checkIntervalMs
to 3 seconds: i.e. check at most 5 times if all assertions fail and the AI service always responds immediately.
When considering the time required for the AI service, .aiWaitFor
may not be very efficient. Using a simple sleep
method might be a useful alternative to waitFor
.
.reportFile
The report file path of the agent.
You can set environment variables during runtime by calling overrideAIConfig
method.
By setting MIDSCENE_DEBUG_AI_PROFILE
, you can take a look at the time and token consumption of AI calls.
LangSmith is a platform designed to debug the LLMs. To integrate LangSmith, please follow these steps:
Launch Midscene, you should see logs like this: