MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents

Overview

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation.

MemGUI-Bench introduces a comprehensive memory-centric benchmark with pass@k evaluation and staged LLM-as-judge evaluation, designed to rigorously test agents' ability to retain and utilize information across complex, multi-application workflows.

128

Tasks

26

Applications

68

Scenarios

36

Avg Steps

1-4

Cross-App Span

89.8%

Memory-Intensive

Key Contributions

📊

Memory Taxonomy

Systematic analysis of 11 agents across 5 architectures, identifying key memory mechanisms.

🎯

128 Curated Tasks

Cross-temporal and cross-spatial retention challenges across 26 real-world applications.

⚖️

MemGUI-Eval

Automated Progressive Scrutiny pipeline with 7 hierarchical metrics for accurate evaluation.

🔬

RQ-Driven Analysis

Comprehensive assessment revealing 5 distinct failure modes and design implications.

Evaluation Metrics

♣ Short-Term Memory (pass@1)

SR - Success Rate: Baseline performance measurement
IRR - Information Retention Rate: Memory fidelity metric
MTPR - Memory-Task Proficiency Ratio: Memory-specific capability isolation

♠ Long-Term Memory (pass@k)

SR@k - Multi-Attempt Success Rate: Cross-session learning capability
FRR - Failure Recovery Rate: Learning from failure efficiency

Framework Overview

Figure 1: MemGUI-Bench Overview

First comprehensive benchmark for GUI agent memory evaluation.

Figure 2: Task Suite Statistics

128 tasks (64 mirror pairs), 26 apps, 3 difficulty levels.

Figure 3: Plug-and-Play Framework

Snapshot-based architecture with parallel execution.

MemGUI-Eval

Progressive Scrutiny Evaluation for Memory-Intensive GUI Tasks

Traditional evaluation methods face critical limitations: rigid rule-based matching cannot handle semantic variations, while "LLM-as-Judge" approaches overwhelm models with complete trajectories.

MemGUI-Eval introduces a novel "Progressive Scrutiny" pipeline that mimics efficient human expert verification, achieving 95.9% F1-score at only $0.031 per trajectory.

1

Cost-Effective Triage Rapid screening with minimal evidence (last 3 screenshots)

2

Full Semantic Analysis Enriched textual descriptions for comprehensive judgment

3

Targeted Visual Verification Precisely requested historical screenshots for final decision

View Detailed Pipeline Failure Analysis

Three-Stage Progressive Scrutiny Pipeline

Four specialized agents: Triage Judge, Step Descriptor, Semantic Judge, and Visual Judge.

Benchmark Comparison

Comprehensive comparison across evaluation environment, pipeline, and agent support dimensions

Benchmark	Evaluation Environment					Evaluation Pipeline			Agents Tested
Benchmark	Memory Tasks	Cross-app Tasks	Total Tasks	3rd-party Apps	Auto Reset	Long-term Memory	Auto Eval	Memory Metrics	Agents Tested
Rule-based Evaluation Pipeline
AndroidArena	22	22	221	✗	✗	✗	✗	1/4	1
AndroidWorld	6	6	116	✓	✓	✗	✗	1/1	3
AndroidLab	45	0	138	✓	✓	✗	✗	1/4	4
LlamaTouch	0	0	495	✓	✗	✗	✗	1/1	4
B-MoCA	0	0	60	✗	✗	✗	✗	1/1	3
MobileAgentBench	0	0	100	✗	✓	✗	✗	1/6	5
LLM-as-a-Judge Evaluation Pipeline
A3	9	0	201	✓	✗	✗	✓	1/2	6
SPA-Bench	40	40	340	✓	✗	✗	✗	1/7	11
MemGUI-Bench (Ours)	115	100	128	✓	✓	✓	✓	4/7	11

MemGUI-Bench is the first benchmark to systematically evaluate both short-term and long-term memory capabilities of Mobile GUI agents.

Key Results

Comprehensive evaluation of 11 GUI agents reveals critical insights about memory capabilities.

Research Questions & Key Findings

RQ1 How do current GUI agents perform on memory-intensive tasks?

4-10× capability gaps hidden by standard benchmarks. M3A achieves highest pass@1 SR (32.8%), while Agent-S2 demonstrates exceptional learning potential with highest pass@3 (49.2%). End-to-end models achieve only 0.0-6.2%. Performance drops dramatically from AndroidWorld to MemGUI-Bench (Agent-S2: 54.3%→27.3%, GUI-Owl-7B: 66.4%→6.2%).

RQ2 Are memory mechanisms essential or optional?

Short-term memory is mandatory; long-term memory is beneficial but optional. Removing short-term memory causes catastrophic collapse (M3A: 32.5%→2.5%, IRR: 35.1%→0.0%). Long-term memory provides +21.9 pp improvement but agents remain functional without it.

RQ3 How does cross-application complexity affect memory?

16-40pp performance degradation, the primary memory bottleneck. M3A drops from 46.4% (1-app) to 30.0% (4-app), Agent-S2 from 50.0% to 10.0%. Cross-app information transfer is the critical challenge.

RQ4 Can long-context capability improve performance?

+18.8pp improvement, revealing untapped potential. M3A multi-turn achieves 51.6% (vs 32.8% single-turn). UI-TARS-1.5-7B with 5-turn limit achieves only 3.1-6.2%, confirming context constraints severely limit performance.

RQ5 Can long-term memory enable cross-session learning?

+21.9pp improvement but remains underutilized. Agent-S2 demonstrates 27.3%→49.2% improvement with 21.5% FRR, while agents without explicit memory show minimal FRR (0.8-4.4%). Only 2 of 11 agents implement cross-session learning.

RQ6 What are the computational trade-offs?

Severe trade-offs under deployment constraints. Agent-S2 (41,760 tokens/step) drops from 49.2% to 0% under token limits. M3A shows optimal balance with graceful degradation (47.7%→21.9%) at 31% of Agent-S2's tokens.

View Full Leaderboard

Explore detailed analysis:

MemGUI-Eval Pipeline Failure Analysis

Citation

@misc{liu2026memguibenchbenchmarkingmemorymobile,
  title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
  author={Guangyi Liu and Pengxiang Zhao and Yaozhen Liang and Qinyi Luo and Shunye Tang and Yuxiang Chai and Weifeng Lin and Han Xiao and WenHao Wang and Siheng Chen and Zhengxi Lu and Gao Wu and Hao Wang and Liang Liu and Yong Liu},
  year={2026},
  eprint={2602.06075},
  archivePrefix={arXiv},
  primaryClass={cs.DC},
  url={https://arxiv.org/abs/2602.06075},
}

Overview

Key Contributions

Memory Taxonomy

128 Curated Tasks

MemGUI-Eval

RQ-Driven Analysis

Evaluation Metrics

♣ Short-Term Memory (pass@1)

♠ Long-Term Memory (pass@k)

Framework Overview

Figure 1: MemGUI-Bench Overview

Figure 2: Task Suite Statistics

Figure 3: Plug-and-Play Framework

MemGUI-Eval

Three-Stage Progressive Scrutiny Pipeline

Benchmark Comparison

Key Results

Research Questions & Key Findings

Citation

Contributing Institutions