#Midscene AndroidWorld Benchmark Report
This is Midscene's test report for the AndroidWorld benchmark. In this run, Midscene achieved Pass@1 93.10%, Pass@2 95.69%, and Pass@3 97.41%.
About Android World
AndroidWorld is an Android agent benchmark from Google Research. It runs on a live Android emulator and evaluates agents on 116 programmatic tasks across 20 real-world Android apps, with task initialization and validation handled by the benchmark.
#Run Configuration
| Field | Value |
|---|---|
| Model Name | Gemini-3.5-Flash |
| Midscene version | 1.9.5 |
| DeepThink | on |
MIDSCENE_REPLANNING_CYCLE_LIMIT | 120 |
| AndroidWorld setup | Stability fixes were applied to the AndroidWorld project to reduce flaky benchmark runs. See examples below. |
| Validation notes | A small number of AndroidWorld validators were aligned with the task intent. The affected cases are listed below. |
#Stability Improvements
The following changes did not change the task intent. They made benchmark execution more stable by reducing browser rendering races, stale accessibility reads, and setup timing issues.
| Change | Affected cases |
|---|---|
| Force canvas pixels to flush after drawing, then use thicker rounded strokes so target colors remain stable in the final canvas pixels. | BrowserDraw |
Retry reading the Success! text from the accessibility tree after browser tasks finish. | BrowserMazeBrowserMultiplyBrowserDraw |
| Before SMS tasks start, prepare the required incoming messages and contacts; now verify those messages are visible in the inbox and those contacts are visible in Contacts before the agent runs. | SimpleSmsReplyMostRecentSimpleSmsSendReceivedAddress |
| Wait for Pro Expense to create its database tables before validators write test data. | ExpenseAddMultipleExpenseAddMultipleFromGalleryExpenseAddMultipleFromMarkorExpenseAddSingleExpenseDeleteDuplicatesExpenseDeleteDuplicates2ExpenseDeleteMultipleExpenseDeleteMultiple2ExpenseDeleteSingle |
| Preload OsmAnd offline map files into the app data directory and wait for OsmAnd to extract its built-in basemap before map tasks run. | OsmAndFavoriteOsmAndMarkerOsmAndTrack |
#Validation Condition Updates
The following AndroidWorld validation checks were changed on the main branch used for this benchmark:
| Change | Affected cases |
|---|---|
Calendar "after start time" now validates against an event one minute after the boundary, avoiding ambiguity across models about whether after includes the boundary time. | SimpleCalendarFirstEventAfterStartTime |
Expense notes imported from Markor accept the extra Reimbursable. suffix that Markor can include; note comparison ignores that suffix and terminal period differences. | ExpenseAddMultipleFromMarkor |
Markor merged notes accept either single-newline or blank-line separation, and also accept Markor's default .md extension when it is auto-added. | MarkorMergeNotes |
| Recipe quantity fields allow omitted units while still rejecting wrong amounts or incompatible units. | RecipeAddSingleRecipeRecipeAddMultipleRecipesRecipeAddMultipleRecipesFromMarkorRecipeAddMultipleRecipesFromMarkor2RecipeAddMultipleRecipesFromImageNotesRecipeIngredientCount |
Minimum brightness is validated against Android's actual minimum setting value, 0, instead of 1. | SystemBrightnessMinSystemBrightnessMinVerify |
#Report Files
Detailed reports are listed below for reference.
Round 1 (115 reports · 108 PASS · 7 FAIL)
| # | Task | Status | Report |
|---|---|---|---|
| 1 | AudioRecorderRecordAudio | PASS | report |
| 2 | AudioRecorderRecordAudioWithFileName | PASS | report |
| 3 | BrowserDraw | PASS | report |
| 4 | BrowserMaze | PASS | report |
| 5 | BrowserMultiply | PASS | report |
| 6 | CameraTakePhoto | PASS | report |
| 7 | CameraTakeVideo | PASS | report |
| 8 | ClockStopWatchPausedVerify | PASS | report |
| 9 | ClockStopWatchRunning | PASS | report |
| 10 | ClockTimerEntry | PASS | report |
| 11 | ContactsAddContact | PASS | report |
| 12 | ContactsNewContactDraft | PASS | report |
| 13 | ExpenseAddMultiple | PASS | report |
| 14 | ExpenseAddMultipleFromGallery | PASS | report |
| 15 | ExpenseAddMultipleFromMarkor | FAIL | report |
| 16 | ExpenseAddSingle | PASS | report |
| 17 | ExpenseDeleteDuplicates | PASS | report |
| 18 | ExpenseDeleteDuplicates2 | PASS | report |
| 19 | ExpenseDeleteMultiple | PASS | report |
| 20 | ExpenseDeleteMultiple2 | PASS | report |
| 21 | ExpenseDeleteSingle | PASS | report |
| 22 | FilesDeleteFile | PASS | report |
| 23 | FilesMoveFile | PASS | report |
| 24 | MarkorAddNoteHeader | PASS | report |
| 25 | MarkorChangeNoteContent | PASS | report |
| 26 | MarkorCreateFolder | PASS | report |
| 27 | MarkorCreateNote | PASS | report |
| 28 | MarkorCreateNoteAndSms | PASS | report |
| 29 | MarkorCreateNoteFromClipboard | PASS | report |
| 30 | MarkorDeleteAllNotes | PASS | report |
| 31 | MarkorDeleteNewestNote | PASS | report |
| 32 | MarkorDeleteNote | PASS | report |
| 33 | MarkorEditNote | PASS | report |
| 34 | MarkorMergeNotes | PASS | report |
| 35 | MarkorMoveNote | PASS | report |
| 36 | MarkorTranscribeReceipt | PASS | report |
| 37 | MarkorTranscribeVideo | FAIL | report |
| 38 | OpenAppTaskEval | PASS | report |
| 39 | OsmAndFavorite | PASS | report |
| 40 | OsmAndMarker | FAIL | report |
| 42 | RecipeAddMultipleRecipes | PASS | report |
| 43 | RecipeAddMultipleRecipesFromImage | FAIL | report |
| 44 | RecipeAddMultipleRecipesFromMarkor | PASS | report |
| 45 | RecipeAddMultipleRecipesFromMarkor2 | PASS | report |
| 46 | RecipeAddSingleRecipe | PASS | report |
| 47 | RecipeDeleteDuplicateRecipes | PASS | report |
| 48 | RecipeDeleteDuplicateRecipes2 | FAIL | report |
| 49 | RecipeDeleteDuplicateRecipes3 | FAIL | report |
| 50 | RecipeDeleteMultipleRecipes | PASS | report |
| 51 | RecipeDeleteMultipleRecipesWithConstraint | PASS | report |
| 52 | RecipeDeleteMultipleRecipesWithNoise | PASS | report |
| 53 | RecipeDeleteSingleRecipe | PASS | report |
| 54 | RecipeDeleteSingleWithRecipeWithNoise | PASS | report |
| 55 | RetroCreatePlaylist | PASS | report |
| 56 | RetroPlayingQueue | PASS | report |
| 57 | RetroPlaylistDuration | PASS | report |
| 58 | RetroSavePlaylist | PASS | report |
| 59 | SaveCopyOfReceiptTaskEval | PASS | report |
| 60 | SimpleCalendarAddOneEvent | PASS | report |
| 61 | SimpleCalendarAddOneEventInTwoWeeks | PASS | report |
| 62 | SimpleCalendarAddOneEventRelativeDay | PASS | report |
| 63 | SimpleCalendarAddOneEventTomorrow | PASS | report |
| 64 | SimpleCalendarAddRepeatingEvent | PASS | report |
| 65 | SimpleCalendarDeleteEvents | PASS | report |
| 66 | SimpleCalendarDeleteEventsOnRelativeDay | PASS | report |
| 67 | SimpleCalendarDeleteOneEvent | PASS | report |
| 68 | SimpleDrawProCreateDrawing | PASS | report |
| 69 | SimpleSmsReply | PASS | report |
| 70 | SimpleSmsReplyMostRecent | PASS | report |
| 71 | SimpleSmsResend | PASS | report |
| 72 | SimpleSmsSend | PASS | report |
| 73 | SimpleSmsSendClipboardContent | PASS | report |
| 74 | SimpleSmsSendReceivedAddress | PASS | report |
| 75 | SystemBluetoothTurnOff | PASS | report |
| 76 | SystemBluetoothTurnOffVerify | PASS | report |
| 77 | SystemBluetoothTurnOn | PASS | report |
| 78 | SystemBluetoothTurnOnVerify | PASS | report |
| 79 | SystemBrightnessMax | PASS | report |
| 80 | SystemBrightnessMaxVerify | PASS | report |
| 81 | SystemBrightnessMin | PASS | report |
| 82 | SystemBrightnessMinVerify | PASS | report |
| 83 | SystemCopyToClipboard | FAIL | report |
| 84 | SystemWifiTurnOff | PASS | report |
| 85 | SystemWifiTurnOffVerify | PASS | report |
| 86 | SystemWifiTurnOn | PASS | report |
| 87 | SystemWifiTurnOnVerify | PASS | report |
| 88 | TurnOffWifiAndTurnOnBluetooth | PASS | report |
| 89 | TurnOnWifiAndOpenApp | PASS | report |
| 90 | VlcCreatePlaylist | PASS | report |
| 91 | VlcCreateTwoPlaylists | PASS | report |
| 92 | NotesIsTodo | PASS | report |
| 93 | NotesMeetingAttendeeCount | PASS | report |
| 94 | NotesRecipeIngredientCount | PASS | report |
| 95 | NotesTodoItemCount | PASS | report |
| 96 | SimpleCalendarAnyEventsOnDate | PASS | report |
| 97 | SimpleCalendarEventOnDateAtTime | PASS | report |
| 98 | SimpleCalendarEventsInNextWeek | PASS | report |
| 99 | SimpleCalendarEventsInTimeRange | PASS | report |
| 100 | SimpleCalendarEventsOnDate | PASS | report |
| 101 | SimpleCalendarFirstEventAfterStartTime | PASS | report |
| 102 | SimpleCalendarLocationOfEvent | PASS | report |
| 103 | SimpleCalendarNextEvent | PASS | report |
| 104 | SimpleCalendarNextMeetingWithPerson | PASS | report |
| 105 | SportsTrackerActivitiesCountForWeek | PASS | report |
| 106 | SportsTrackerActivitiesOnDate | PASS | report |
| 107 | SportsTrackerActivityDuration | PASS | report |
| 108 | SportsTrackerLongestDistanceActivity | PASS | report |
| 109 | SportsTrackerTotalDistanceForCategoryOverInterval | PASS | report |
| 110 | SportsTrackerTotalDurationForCategoryThisWeek | PASS | report |
| 111 | TasksCompletedTasksForDate | PASS | report |
| 112 | TasksDueNextWeek | PASS | report |
| 113 | TasksDueOnDate | PASS | report |
| 114 | TasksHighPriorityTasks | PASS | report |
| 115 | TasksHighPriorityTasksDueOnDate | PASS | report |
| 116 | TasksIncompleteTasksOnDate | PASS | report |

