Midscene AndroidWorld Benchmark Report

This is Midscene's test report for the AndroidWorld benchmark. In this run, Midscene achieved Pass@1 93.10%, Pass@2 95.69%, and Pass@3 97.41%.

About Android World

AndroidWorld is an Android agent benchmark from Google Research. It runs on a live Android emulator and evaluates agents on 116 programmatic tasks across 20 real-world Android apps, with task initialization and validation handled by the benchmark.

Run Configuration

FieldValue
Model NameGemini-3.5-Flash
Midscene version1.9.5
DeepThinkon
MIDSCENE_REPLANNING_CYCLE_LIMIT120
AndroidWorld setupStability fixes were applied to the AndroidWorld project to reduce flaky benchmark runs. See examples below.
Validation notesA small number of AndroidWorld validators were aligned with the task intent. The affected cases are listed below.

Stability Improvements

The following changes did not change the task intent. They made benchmark execution more stable by reducing browser rendering races, stale accessibility reads, and setup timing issues.

ChangeAffected cases
Force canvas pixels to flush after drawing, then use thicker rounded strokes so target colors remain stable in the final canvas pixels.BrowserDraw
Retry reading the Success! text from the accessibility tree after browser tasks finish.BrowserMaze
BrowserMultiply
BrowserDraw
Before SMS tasks start, prepare the required incoming messages and contacts; now verify those messages are visible in the inbox and those contacts are visible in Contacts before the agent runs.SimpleSmsReplyMostRecent
SimpleSmsSendReceivedAddress
Wait for Pro Expense to create its database tables before validators write test data.ExpenseAddMultiple
ExpenseAddMultipleFromGallery
ExpenseAddMultipleFromMarkor
ExpenseAddSingle
ExpenseDeleteDuplicates
ExpenseDeleteDuplicates2
ExpenseDeleteMultiple
ExpenseDeleteMultiple2
ExpenseDeleteSingle
Preload OsmAnd offline map files into the app data directory and wait for OsmAnd to extract its built-in basemap before map tasks run.OsmAndFavorite
OsmAndMarker
OsmAndTrack

Validation Condition Updates

The following AndroidWorld validation checks were changed on the main branch used for this benchmark:

ChangeAffected cases
Calendar "after start time" now validates against an event one minute after the boundary, avoiding ambiguity across models about whether after includes the boundary time.SimpleCalendarFirstEventAfterStartTime
Expense notes imported from Markor accept the extra Reimbursable. suffix that Markor can include; note comparison ignores that suffix and terminal period differences.ExpenseAddMultipleFromMarkor
Markor merged notes accept either single-newline or blank-line separation, and also accept Markor's default .md extension when it is auto-added.MarkorMergeNotes
Recipe quantity fields allow omitted units while still rejecting wrong amounts or incompatible units.RecipeAddSingleRecipe
RecipeAddMultipleRecipes
RecipeAddMultipleRecipesFromMarkor
RecipeAddMultipleRecipesFromMarkor2
RecipeAddMultipleRecipesFromImage
NotesRecipeIngredientCount
Minimum brightness is validated against Android's actual minimum setting value, 0, instead of 1.SystemBrightnessMin
SystemBrightnessMinVerify

Report Files

Detailed reports are listed below for reference.

Round 1 (115 reports · 108 PASS · 7 FAIL)
#TaskStatusReport
1AudioRecorderRecordAudioPASSreport
2AudioRecorderRecordAudioWithFileNamePASSreport
3BrowserDrawPASSreport
4BrowserMazePASSreport
5BrowserMultiplyPASSreport
6CameraTakePhotoPASSreport
7CameraTakeVideoPASSreport
8ClockStopWatchPausedVerifyPASSreport
9ClockStopWatchRunningPASSreport
10ClockTimerEntryPASSreport
11ContactsAddContactPASSreport
12ContactsNewContactDraftPASSreport
13ExpenseAddMultiplePASSreport
14ExpenseAddMultipleFromGalleryPASSreport
15ExpenseAddMultipleFromMarkorFAILreport
16ExpenseAddSinglePASSreport
17ExpenseDeleteDuplicatesPASSreport
18ExpenseDeleteDuplicates2PASSreport
19ExpenseDeleteMultiplePASSreport
20ExpenseDeleteMultiple2PASSreport
21ExpenseDeleteSinglePASSreport
22FilesDeleteFilePASSreport
23FilesMoveFilePASSreport
24MarkorAddNoteHeaderPASSreport
25MarkorChangeNoteContentPASSreport
26MarkorCreateFolderPASSreport
27MarkorCreateNotePASSreport
28MarkorCreateNoteAndSmsPASSreport
29MarkorCreateNoteFromClipboardPASSreport
30MarkorDeleteAllNotesPASSreport
31MarkorDeleteNewestNotePASSreport
32MarkorDeleteNotePASSreport
33MarkorEditNotePASSreport
34MarkorMergeNotesPASSreport
35MarkorMoveNotePASSreport
36MarkorTranscribeReceiptPASSreport
37MarkorTranscribeVideoFAILreport
38OpenAppTaskEvalPASSreport
39OsmAndFavoritePASSreport
40OsmAndMarkerFAILreport
42RecipeAddMultipleRecipesPASSreport
43RecipeAddMultipleRecipesFromImageFAILreport
44RecipeAddMultipleRecipesFromMarkorPASSreport
45RecipeAddMultipleRecipesFromMarkor2PASSreport
46RecipeAddSingleRecipePASSreport
47RecipeDeleteDuplicateRecipesPASSreport
48RecipeDeleteDuplicateRecipes2FAILreport
49RecipeDeleteDuplicateRecipes3FAILreport
50RecipeDeleteMultipleRecipesPASSreport
51RecipeDeleteMultipleRecipesWithConstraintPASSreport
52RecipeDeleteMultipleRecipesWithNoisePASSreport
53RecipeDeleteSingleRecipePASSreport
54RecipeDeleteSingleWithRecipeWithNoisePASSreport
55RetroCreatePlaylistPASSreport
56RetroPlayingQueuePASSreport
57RetroPlaylistDurationPASSreport
58RetroSavePlaylistPASSreport
59SaveCopyOfReceiptTaskEvalPASSreport
60SimpleCalendarAddOneEventPASSreport
61SimpleCalendarAddOneEventInTwoWeeksPASSreport
62SimpleCalendarAddOneEventRelativeDayPASSreport
63SimpleCalendarAddOneEventTomorrowPASSreport
64SimpleCalendarAddRepeatingEventPASSreport
65SimpleCalendarDeleteEventsPASSreport
66SimpleCalendarDeleteEventsOnRelativeDayPASSreport
67SimpleCalendarDeleteOneEventPASSreport
68SimpleDrawProCreateDrawingPASSreport
69SimpleSmsReplyPASSreport
70SimpleSmsReplyMostRecentPASSreport
71SimpleSmsResendPASSreport
72SimpleSmsSendPASSreport
73SimpleSmsSendClipboardContentPASSreport
74SimpleSmsSendReceivedAddressPASSreport
75SystemBluetoothTurnOffPASSreport
76SystemBluetoothTurnOffVerifyPASSreport
77SystemBluetoothTurnOnPASSreport
78SystemBluetoothTurnOnVerifyPASSreport
79SystemBrightnessMaxPASSreport
80SystemBrightnessMaxVerifyPASSreport
81SystemBrightnessMinPASSreport
82SystemBrightnessMinVerifyPASSreport
83SystemCopyToClipboardFAILreport
84SystemWifiTurnOffPASSreport
85SystemWifiTurnOffVerifyPASSreport
86SystemWifiTurnOnPASSreport
87SystemWifiTurnOnVerifyPASSreport
88TurnOffWifiAndTurnOnBluetoothPASSreport
89TurnOnWifiAndOpenAppPASSreport
90VlcCreatePlaylistPASSreport
91VlcCreateTwoPlaylistsPASSreport
92NotesIsTodoPASSreport
93NotesMeetingAttendeeCountPASSreport
94NotesRecipeIngredientCountPASSreport
95NotesTodoItemCountPASSreport
96SimpleCalendarAnyEventsOnDatePASSreport
97SimpleCalendarEventOnDateAtTimePASSreport
98SimpleCalendarEventsInNextWeekPASSreport
99SimpleCalendarEventsInTimeRangePASSreport
100SimpleCalendarEventsOnDatePASSreport
101SimpleCalendarFirstEventAfterStartTimePASSreport
102SimpleCalendarLocationOfEventPASSreport
103SimpleCalendarNextEventPASSreport
104SimpleCalendarNextMeetingWithPersonPASSreport
105SportsTrackerActivitiesCountForWeekPASSreport
106SportsTrackerActivitiesOnDatePASSreport
107SportsTrackerActivityDurationPASSreport
108SportsTrackerLongestDistanceActivityPASSreport
109SportsTrackerTotalDistanceForCategoryOverIntervalPASSreport
110SportsTrackerTotalDurationForCategoryThisWeekPASSreport
111TasksCompletedTasksForDatePASSreport
112TasksDueNextWeekPASSreport
113TasksDueOnDatePASSreport
114TasksHighPriorityTasksPASSreport
115TasksHighPriorityTasksDueOnDatePASSreport
116TasksIncompleteTasksOnDatePASSreport
Round 2 (7 reports · 3 PASS · 4 FAIL)
#TaskStatusReport
37MarkorTranscribeVideoFAILreport
40OsmAndMarkerFAILreport
41OsmAndTrackPASSreport
43RecipeAddMultipleRecipesFromImagePASSreport
48RecipeDeleteDuplicateRecipes2FAILreport
49RecipeDeleteDuplicateRecipes3FAILreport
83SystemCopyToClipboardPASSreport
Round 3 (5 reports · 2 PASS · 3 FAIL)
#TaskStatusReport
15ExpenseAddMultipleFromMarkorPASSreport
37MarkorTranscribeVideoFAILreport
40OsmAndMarkerPASSreport
48RecipeDeleteDuplicateRecipes2FAILreport
49RecipeDeleteDuplicateRecipes3FAILreport