Benchmarking memory for mobile GUI agents

MemGUI-Bench Leaderboard

A memory-centric benchmark for mobile GUI agents in dynamic environments, covering short-term recall, long-term improvement, cross-app workflows, and MemGUI-Eval based judgment.

128 Tasks

26 Apps

68 Scenarios

89.8% Memory Tasks

Live results

Leaderboard

p@1 measures first-attempt success, p@3 measures best-of-three success, and IRR/MTPR/FRR isolate memory quality and recovery.

UI Uses UI Tree LTM Has Long-Term Memory Workflow Multi-agent framework Model End-to-end model

Main Results

Average performance across all 128 tasks

Benchmark

MemGUI-Bench contains 128 memory-intensive mobile GUI tasks across 26 apps and 68 scenarios. Tasks stress cross-step retention, cross-app transfer, and cross-session learning.

Evaluation

MemGUI-Eval uses progressive scrutiny: lightweight triage, trajectory description, semantic judgment, and targeted visual verification when needed.

Resources

Evaluation pipeline Failure analysis Task dataset Trajectories

Citation

Use MemGUI-Bench in your work

@article{liu2026memgui,
  title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
  author={Liu, Guangyi and Zhao, Pengxiang and Liang, Yaozhen and Luo, Qinyi and Tang, Shunye and Chai, Yuxiang and Lin, Weifeng and Xiao, Han and Wang, WenHao and Chen, Siheng and others},
  journal={arXiv preprint arXiv:2602.06075},
  year={2026}
}