MemGUI-Bench MemGUI-Bench
  • Overview
  • Benchmark
  • Results
  • Leaderboard
  • Submit
  • Citation

MemGUI-Eval

Progressive Scrutiny Evaluation for Memory-Intensive GUI Tasks

Pipeline Overview

MemGUI-Eval Pipeline Overview

Three-Stage Progressive Scrutiny: Cost-effective triage → Full semantic analysis → Targeted visual verification. Four specialized agents work together achieving 95.9% F1-score at only $0.031/trajectory.

Comparison with Existing Approaches

Evaluator Comparison

Traditional methods range from rigid rule-based matching (limited coverage) to "LLM-as-Judge" approaches that overwhelm models with complete trajectories (high cost, low accuracy).

Stage-by-Stage Examples

1

Cost-Effective Triage

Rapidly processes straightforward cases using minimal evidence (last 3 screenshots). Concludes "success" only when evidence irrefutably demonstrates task completion.

Stage 1 Success Example
2

Full Semantic Analysis

When triage is inconclusive, conducts comprehensive semantic analysis. Step Descriptor generates detailed descriptions, Semantic Judge synthesizes all context.

Stage 2 Success Success Case
Stage 2 Failed Failed Case + IRR
3

Targeted Visual Verification

Core innovation: provides precisely the visual evidence the model actively requested via required_steps, rather than overwhelming it with all screenshots.

Stage 3 Success Success Case
Stage 3 Failed Failed Case + IRR
Back to Home View Failure Analysis

© 2026 MemGUI-Bench Team. All rights reserved.