Customize Model and Provider
Midscene uses the OpenAI SDK to call AI services. You can customize the configuration using environment variables. All the configs can also be used in the Chrome Extension.
These are the main configs, in which OPENAI_API_KEY
is required.
Required:
# replace by your own
export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
Optional configs:
# if you want to use a customized endpoint
export OPENAI_BASE_URL="https://..."
# if you want to use Azure OpenAI Service
export OPENAI_USE_AZURE="true"
# if you want to specify a model name other than gpt-4o
export MIDSCENE_MODEL_NAME='qwen-vl-max-lates';
# if you want to pass customized JSON data to the `init` process of OpenAI SDK
export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"key": "value"}}'
# if you want to use proxy. Midscene uses `socks-proxy-agent` under the hood.
export MIDSCENE_OPENAI_SOCKS_PROXY="socks5://127.0.0.1:1080"
Using Azure OpenAI Service
export MIDSCENE_USE_AZURE_OPENAI=1
export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
export MIDSCENE_AZURE_OPENAI_INIT_CONFIG_JSON='{"apiVersion": "2024-11-01-preview", "endpoint": "...", "deployment": "..."}'
Choose a model other than gpt-4o
We find that gpt-4o
performs the best for Midscene at this moment. The other known supported models are claude-3-opus-20240229
, gemini-1.5-pro
, qwen-vl-max-latest
, doubao-vision-pro-32k
If you want to use other models, please follow these steps:
- Choose a model that supports image input (a.k.a. multimodal model).
- Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the
OPENAI_BASE_URL
, OPENAI_API_KEY
and MIDSCENE_MODEL_NAME
.
- If you find it not working well after changing the model, you can try using some short and clear prompt (or roll back to the previous model). See more details in Prompting Tips.
- Remember to follow the terms of use of each model.
Example: Using claude-3-opus-20240229
from Anthropic
When configuring MIDSCENE_USE_ANTHROPIC_SDK=1
, Midscene will use Anthropic SDK (@anthropic-ai/sdk
) to call the model.
Configure the environment variables:
export MIDSCENE_USE_ANTHROPIC_SDK=1
export ANTHROPIC_API_KEY="....."
export MIDSCENE_MODEL_NAME="claude-3-opus-20240229"
Example: Using gemini-1.5-pro
from Google
Configure the environment variables:
export OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai"
export OPENAI_API_KEY="....."
export MIDSCENE_MODEL_NAME="gemini-1.5-pro"
Example: Using qwen-vl-max-latest
from Aliyun
Configure the environment variables:
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
Example: Using doubao-vision-pro-32k
from Volcengine
Create a inference point first: https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
Configure the environment variables:
export OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
export OPENAI_API_KEY="..."
export MIDSCENE_MODEL_NAME="ep-202....."
Troubleshooting LLM Service Connectivity Issues
If you want to troubleshoot connectivity issues, you can use the 'connectivity-test' folder in our example project: https://github.com/web-infra-dev/midscene-example/tree/main/connectivity-test
Put your .env
file in the connectivity-test
folder, and run the test with npm i && npm run test
.