MemGUI-Bench

Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Guangyi Liu1* Pengxiang Zhao1* Yaozhen Liang1* Qinyi Luo2 Shunye Tang2 Yuxiang Chai3 Weifeng Lin3 Han Xiao3 WenHao Wang1 Siheng Chen4 Zhengxi Lu1 Gao Wu1 Hao Wang5 Liang Liu5† Yong Liu1
1Zhejiang University 2Nankai University 3The Chinese University of Hong Kong 4Shanghai Jiao Tong University 5vivo AI Lab
* Equal contribution † Project Lead ✉ Corresponding author
🏆 Live Leaderboard View All →
Loading...
Explore Full Leaderboard

Overview

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation.

MemGUI-Bench introduces a comprehensive memory-centric benchmark with pass@k evaluation and staged LLM-as-judge evaluation, designed to rigorously test agents' ability to retain and utilize information across complex, multi-application workflows.

128
Tasks
26
Applications
68
Scenarios
36
Avg Steps
1-4
Cross-App Span
89.8%
Memory-Intensive

Key Contributions

📊

Memory Taxonomy

Systematic analysis of 11 agents across 5 architectures, identifying key memory mechanisms.

🎯

128 Curated Tasks

Cross-temporal and cross-spatial retention challenges across 26 real-world applications.

⚖️

MemGUI-Eval

Automated Progressive Scrutiny pipeline with 7 hierarchical metrics for accurate evaluation.

🔬

RQ-Driven Analysis

Comprehensive assessment revealing 5 distinct failure modes and design implications.

Evaluation Metrics

Short-Term Memory (pass@1)

  • SR - Success Rate: Baseline performance measurement
  • IRR - Information Retention Rate: Memory fidelity metric
  • MTPR - Memory-Task Proficiency Ratio: Memory-specific capability isolation

Long-Term Memory (pass@k)

  • SR@k - Multi-Attempt Success Rate: Cross-session learning capability
  • FRR - Failure Recovery Rate: Learning from failure efficiency

Framework Overview

MemGUI-Bench Overview
Figure 1: MemGUI-Bench Overview

First comprehensive benchmark for GUI agent memory evaluation.

Task Suite Statistics
Figure 2: Task Suite Statistics

128 tasks (64 mirror pairs), 26 apps, 3 difficulty levels.

Unified Architecture
Figure 3: Plug-and-Play Framework

Snapshot-based architecture with parallel execution.

MemGUI-Eval

Progressive Scrutiny Evaluation for Memory-Intensive GUI Tasks

Traditional evaluation methods face critical limitations: rigid rule-based matching cannot handle semantic variations, while "LLM-as-Judge" approaches overwhelm models with complete trajectories.

MemGUI-Eval introduces a novel "Progressive Scrutiny" pipeline that mimics efficient human expert verification, achieving 95.9% F1-score at only $0.031 per trajectory.

1
Cost-Effective Triage Rapid screening with minimal evidence (last 3 screenshots)
2
Full Semantic Analysis Enriched textual descriptions for comprehensive judgment
3
Targeted Visual Verification Precisely requested historical screenshots for final decision
MemGUI-Eval Pipeline
Three-Stage Progressive Scrutiny Pipeline

Four specialized agents: Triage Judge, Step Descriptor, Semantic Judge, and Visual Judge.

Benchmark Comparison

Comprehensive comparison across evaluation environment, pipeline, and agent support dimensions

Benchmark Evaluation Environment Evaluation Pipeline Agents
Tested
Memory
Tasks
Cross-app
Tasks
Total
Tasks
3rd-party
Apps
Auto
Reset
Long-term
Memory
Auto
Eval
Memory
Metrics
Rule-based Evaluation Pipeline
AndroidArena 22 22 221 1/4 1
AndroidWorld 6 6 116 1/1 3
AndroidLab 45 0 138 1/4 4
LlamaTouch 0 0 495 1/1 4
B-MoCA 0 0 60 1/1 3
MobileAgentBench 0 0 100 1/6 5
LLM-as-a-Judge Evaluation Pipeline
A3 9 0 201 1/2 6
SPA-Bench 40 40 340 1/7 11
MemGUI-Bench (Ours) 115 100 128 4/7 11

MemGUI-Bench is the first benchmark to systematically evaluate both short-term and long-term memory capabilities of Mobile GUI agents.

Key Results

Comprehensive evaluation of 11 GUI agents reveals critical insights about memory capabilities.

Research Questions & Key Findings

RQ1 How do current GUI agents perform on memory-intensive tasks?
4-10× capability gaps hidden by standard benchmarks. M3A achieves highest pass@1 SR (32.8%), while Agent-S2 demonstrates exceptional learning potential with highest pass@3 (49.2%). End-to-end models achieve only 0.0-6.2%. Performance drops dramatically from AndroidWorld to MemGUI-Bench (Agent-S2: 54.3%→27.3%, GUI-Owl-7B: 66.4%→6.2%).
RQ2 Are memory mechanisms essential or optional?
Short-term memory is mandatory; long-term memory is beneficial but optional. Removing short-term memory causes catastrophic collapse (M3A: 32.5%→2.5%, IRR: 35.1%→0.0%). Long-term memory provides +21.9 pp improvement but agents remain functional without it.
RQ3 How does cross-application complexity affect memory?
16-40pp performance degradation, the primary memory bottleneck. M3A drops from 46.4% (1-app) to 30.0% (4-app), Agent-S2 from 50.0% to 10.0%. Cross-app information transfer is the critical challenge.
RQ4 Can long-context capability improve performance?
+18.8pp improvement, revealing untapped potential. M3A multi-turn achieves 51.6% (vs 32.8% single-turn). UI-TARS-1.5-7B with 5-turn limit achieves only 3.1-6.2%, confirming context constraints severely limit performance.
RQ5 Can long-term memory enable cross-session learning?
+21.9pp improvement but remains underutilized. Agent-S2 demonstrates 27.3%→49.2% improvement with 21.5% FRR, while agents without explicit memory show minimal FRR (0.8-4.4%). Only 2 of 11 agents implement cross-session learning.
RQ6 What are the computational trade-offs?
Severe trade-offs under deployment constraints. Agent-S2 (41,760 tokens/step) drops from 49.2% to 0% under token limits. M3A shows optimal balance with graceful degradation (47.7%→21.9%) at 31% of Agent-S2's tokens.

Citation

@misc{liu2026memguibenchbenchmarkingmemorymobile,
  title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
  author={Guangyi Liu and Pengxiang Zhao and Yaozhen Liang and Qinyi Luo and Shunye Tang and Yuxiang Chai and Weifeng Lin and Han Xiao and WenHao Wang and Siheng Chen and Zhengxi Lu and Gao Wu and Hao Wang and Liang Liu and Yong Liu},
  year={2026},
  eprint={2602.06075},
  archivePrefix={arXiv},
  primaryClass={cs.DC},
  url={https://arxiv.org/abs/2602.06075},
}

Contributing Institutions