The LearnGUI dataset is a comprehensive collection designed for studying demonstration-based learning in mobile GUI agents, featuring 2,353 instructions across 73 applications with an average of 13.2 steps per task.
- Rich Few-shot Learning Support: Provides k-shot combinations (k=1,2,3) for each task
- Multi-dimensional Similarity Metrics: Covers instruction, UI, and action dimensions
- Natural Task Variation: Reflects real-world mobile task diversity within applications
- Systematic Analysis Framework: Enables detailed study of demonstration impact on learning outcomes

Dataset Comparison
Dataset | # Instructions | # Apps | # Steps | Environment | High-level | Low-level | Ground Truth | Few-shot |
---|---|---|---|---|---|---|---|---|
PixelHelp | 187 | 4 | 4.2 | ❌ | ✅ | ❌ | ✅ | ❌ |
MoTIF | 276 | 125 | 4.5 | ❌ | ✅ | ✅ | ✅ | ❌ |
UIBert | 16,660 | - | 1 | ❌ | ❌ | ✅ | ✅ | ❌ |
UGIF | 523 | 12 | 6.3 | ❌ | ✅ | ✅ | ✅ | ❌ |
AITW | 30,378 | 357 | 6.5 | ❌ | ✅ | ❌ | ✅ | ❌ |
AITZ | 2,504 | 70 | 7.5 | ❌ | ✅ | ✅ | ✅ | ❌ |
AndroidControl | 15,283 | 833 | 4.8 | ❌ | ✅ | ✅ | ✅ | ❌ |
AMEX | 2,946 | 110 | 12.8 | ❌ | ✅ | ❌ | ✅ | ❌ |
MobileAgentBench | 100 | 10 | - | ❌ | ✅ | ❌ | ❌ | ❌ |
AppAgent | 50 | 10 | - | ❌ | ✅ | ❌ | ❌ | ❌ |
LlamaTouch | 496 | 57 | 7.01 | ✅ | ✅ | ❌ | ✅ | ❌ |
AndroidWorld | 116 | 20 | - | ✅ | ✅ | ❌ | ❌ | ❌ |
AndroidLab | 138 | 9 | 8.5 | ✅ | ✅ | ❌ | ❌ | ❌ |
LearnGUI (Ours) | 2,353 | 73 | 13.2 | ✅ | ✅ | ✅ | ✅ | ✅ |
Table: Comparison of different datasets and environments for benchmarking Mobile GUI agents. Column definitions: # Inst. (number of instructions), # Apps (number of applications), # Step (average steps per task), Env. (supports environment interactions), HL (has high-level instructions), LL (has low-level instructions), GT (provides ground truth trajectories), FS (supports few-shot learning).
Dataset Statistics
Split | K-shot | Tasks | Apps | Step actions | Avg InsSim | Avg UISim | Avg ActSim | UISHActSH | UISHActSL | UISLActSH | UISLActSL |
---|---|---|---|---|---|---|---|---|---|---|---|
Offline-Train | 1-shot | 2,001 | 44 | 26,184 | 0.845 | 0.901 | 0.858 | 364 | 400 | 403 | 834 |
Offline-Train | 2-shot | 2,001 | 44 | 26,184 | 0.818 | 0.898 | 0.845 | 216 | 360 | 358 | 1,067 |
Offline-Train | 3-shot | 2,001 | 44 | 26,184 | 0.798 | 0.895 | 0.836 | 152 | 346 | 310 | 1,193 |
Offline-Test | 1-shot | 251 | 9 | 3,469 | 0.798 | 0.868 | 0.867 | 37 | 49 | 56 | 109 |
Offline-Test | 2-shot | 251 | 9 | 3,469 | 0.767 | 0.855 | 0.853 | 15 | 42 | 55 | 139 |
Offline-Test | 3-shot | 251 | 9 | 3,469 | 0.745 | 0.847 | 0.847 | 10 | 36 | 49 | 156 |
Online-Test | 1-shot | 101 | 20 | 1,423 | - | - | - | - | - | - | - |
Table: Statistics of LearnGUI dataset splits. Each split is analyzed across multiple dimensions: Tasks (number of tasks), Apps (number of applications covered), Step actions (total action steps), similarity metrics (Avg Ins/UI/ActSim), and distribution across four similarity profiles categorized by high (SH) and low (SL) UI and action similarity.