MemGUI-Eval | MemGUI-Bench

Pipeline Overview

Three-Stage Progressive Scrutiny: Cost-effective triage → Full semantic analysis → Targeted visual verification. Four specialized agents work together achieving 95.9% F1-score at only $0.031/trajectory.

Comparison with Existing Approaches

Traditional methods range from rigid rule-based matching (limited coverage) to "LLM-as-Judge" approaches that overwhelm models with complete trajectories (high cost, low accuracy).

Stage-by-Stage Examples

Cost-Effective Triage

Rapidly processes straightforward cases using minimal evidence (last 3 screenshots). Concludes "success" only when evidence irrefutably demonstrates task completion.

Full Semantic Analysis

When triage is inconclusive, conducts comprehensive semantic analysis. Step Descriptor generates detailed descriptions, Semantic Judge synthesizes all context.

Success Case

Failed Case + IRR

Targeted Visual Verification

Core innovation: provides precisely the visual evidence the model actively requested via required_steps, rather than overwhelming it with all screenshots.

Success Case

Failed Case + IRR

Back to Home View Failure Analysis