MemGUI-Bench MemGUI-Bench

Submit Your Results

Join the MemGUI-Bench leaderboard and contribute to the research community

Quick Start

1

Run Benchmark

Launch MemGUI containers and run your agent with mg eval. Logs are written to local traj_logs/.

2

Fill Metadata

Create or update docs/data/agents/{agent}.json with metadata and MemGUI-Eval metrics.

3

Submit via GitHub

Submit the results JSON via GitHub PR. Optionally upload the generated .json.gz and .mp4 trajectory preview files to HuggingFace.

Submission Methods

Results JSON → GitHub Pull Request

Submit your leaderboard results via GitHub PR:

  1. Fork lgy0404/MemGUI-Bench
  2. Run the official benchmark with mg eval and review the outputs under traj_logs/{run-name}/_memgui_eval/
  3. Add your results file as docs/data/agents/{agent-id}.json
  4. Add {agent-id} to docs/data/index.json
  5. Create a Pull Request with title: [Leaderboard] Add {Your Agent Name}
  6. Include the command, log root, pass@K setting, and trajectory link in the PR description
  7. Our team will review and merge within 3-5 business days

Trajectories → HuggingFace Dataset

Upload the static trajectory preview files via HuggingFace Pull Request:

  1. Generate docs/trajs/{agent-id}.json.gz and docs/trajs/{agent-id}.mp4 from your run
  2. Go to lgy0404/memgui-bench-trajs
  3. Click Community tab → New Pull RequestUpload files
  4. Upload both files under site/trajs/ and submit the PR

Results Format

Run the benchmark with the current MobileWorld-style CLI. Official leaderboard submissions should use the full task set and pass@3 unless otherwise noted by the benchmark maintainers:

sudo uv run mg eval \
  --agent-type qwen3vl \
  --model-name your-model-name \
  --task ALL \
  --pass-at-k 3 \
  --log-file-root traj_logs/your-run-name \
  --max-concurrency 2

After the run, MemGUI-Eval artifacts are saved under traj_logs/your-run-name/_memgui_eval/. Create or update docs/data/agents/{agent-id}.json using the schema below. Fill metadata manually and copy benchmark metrics from the MemGUI-Eval results.

{
  "name": "YourAgent",
  "backbone": "-",                    // Model backbone (e.g., "Gemini-2.5-Pro", or "-" for fine-tuned models)
  "type": "",                         // "Agentic Workflow" or "Agent-as-a-Model"
  "institution": "",                  // Your institution name
  "date": "",                         // Submission date (YYYY-MM-DD)
  "paperLink": "",                    // Paper URL (arxiv, etc.)
  "codeLink": "",                     // Code repository URL
  "trajFile": "trajs/your-agent-name.json.gz", // Optional; include when trajectory preview files are submitted
  "hasUITree": false,                 // true if agent uses UI tree/accessibility tree
  "hasLongTermMemory": false,         // true if agent has long-term memory capability
  
  // ↓↓↓ Auto-generated metrics (do not modify) ↓↓↓
  "crossApp": {
    "app1": { "p1": 21.4, "p3": 35.7, "irr": 11.7 },
    "app2": { "p1": 1.8, "p3": 1.8, "irr": 3.6 },
    "app3": { "p1": 2.9, "p3": 7.1, "irr": 7.1 },
    "app4": { "p1": 0.0, "p3": 4.0, "irr": 4.0 }
  },
  "difficulty": {
    "easy": { "p1": 14.6, "p3": 22.9, "irr": 9.89 },
    "medium": { "p1": 0.0, "p3": 2.4, "irr": 4.98 },
    "hard": { "p1": 2.6, "p3": 2.6, "irr": 2.63 }
  },
  "avg": { "p1": 6.2, "p3": 10.2 },
  "metrics": {
    "shortTerm": {
      "irr": 5.7,
      "mtpr": 0.07,
      "stepRatio": 0.92,
      "timePerStep": 9.6,
      "costPerStep": null
    },
    "longTerm": {
      "frr": 3.3,
      "stepRatio": 0.93,
      "timePerStep": 9.6,
      "costPerStep": null
    }
  }
}
Metadata Fields to Fill
FieldDescriptionExample
backboneBase model used"Gemini-2.5-Pro", "GPT-4o", "-" (for fine-tuned)
typeAgent architecture type"Agentic Workflow" or "Agent-as-a-Model"
institutionAffiliation"MIT", "Google DeepMind"
dateSubmission date"2026-02-03"
paperLinkPaper URL"https://arxiv.org/abs/..."
codeLinkCode repository"https://github.com/..."
trajFileLogical trajectory bundle path, only when preview files are submitted"trajs/your-agent-name.json.gz"
hasUITreeUses accessibility tree?true / false
hasLongTermMemoryHas long-term memory?true / false

Trajectory Data

Important: Trajectory Submission

Trajectory submission is optional but highly recommended for the following reasons:

  • Enables verification of your results by our team
  • Allows reproduction and analysis by other researchers
  • Contributes to the research community's understanding of agent behavior

Session Output Structure

When you run mg eval, the following structure is generated under the local log root:

traj_logs/your-run-name/
├── metadata.json
├── 001-FindProductAndFilter/
│   ├── traj.json                  # MobileWorld trajectory
│   ├── result.txt                 # Task-level score
│   ├── thread_<id>.log
│   ├── screenshots/
│   └── marked_screenshots/
├── _attempt_trajs/
│   └── 001-FindProductAndFilter/
│       └── attempt_2/
│           ├── traj.json
│           ├── result.txt
│           └── screenshots/
└── _memgui_eval/
    ├── results.csv                # Aggregated MemGUI-Eval rows
    └── 001-FindProductAndFilter/
        └── qwen3vl/
            └── attempt_1/
                ├── log.json
                ├── 0.png, 1.png, ...
                ├── final_decision.json
                └── evaluation_summary.json

Upload to HuggingFace

Generate the static trajectory preview bundle and submit the two output files via Pull Request to lgy0404/memgui-bench-trajs:

# Step 1: Generate the static preview bundle
python3 docs/bundle_trajs.py traj_logs/your-run-name \
  -o docs/trajs/your-agent-name.json.gz \
  --with-screenshots

# This creates:
#   docs/trajs/your-agent-name.json.gz
#   docs/trajs/your-agent-name.mp4

# Step 2: Upload via HuggingFace Web UI
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" tab → "New Pull Request" → "Upload files"
# 3. Upload both files to site/trajs/ and submit the PR
File Naming Convention

Name both preview files to match your agent JSON filename (lowercase with hyphens), e.g.:

  • docs/trajs/m3a.json.gz and docs/trajs/m3a.mp4
  • docs/trajs/appagent.json.gz and docs/trajs/appagent.mp4
  • docs/trajs/mobile-agent-v2.json.gz and docs/trajs/mobile-agent-v2.mp4

Do not submit raw traj_logs zip files for leaderboard previews. Maintainers will review the PR and update the public manifest after the files are accepted.

Review Process

PR Submitted

Create a Pull Request to add your JSON to docs/data/agents/ and update docs/data/index.json.

Format Validation

We verify that all required fields are present and correctly formatted.

Results Verification

If trajectory preview files are provided on HuggingFace, we validate the bundle and inspect random task traces.

Leaderboard Update

Approved results are merged and visible on the leaderboard within 3-5 business days.

Frequently Asked Questions

Trajectory submission is optional but recommended. While we accept results without trajectories, providing them enables verification and helps the research community. Submissions with trajectories may receive priority in the review process.

Yes! You can submit updated results at any time via a new PR. We maintain version history and will update the leaderboard with your latest results. Please indicate in your PR that this is an update to existing results.

All performance metrics are auto-generated:
  • crossApp: Pass@1, Pass@3, IRR by number of apps (1-4)
  • difficulty: Pass@1, Pass@3, IRR by difficulty (easy, medium, hard)
  • avg: Overall Pass@1 and Pass@3
  • metrics.shortTerm: IRR, MTPR, stepRatio, timePerStep, costPerStep
  • metrics.longTerm: FRR, stepRatio, timePerStep, costPerStep
The current runner saves the raw evaluation table in _memgui_eval/results.csv, and the viewer can summarize it. For the leaderboard JSON, keep those computed metric values consistent with the run and fill in the metadata fields (backbone, type, institution, etc.).

We perform the following verification steps:
  1. Check format compliance and metric calculations
  2. If trajectory preview files are provided on HuggingFace, validate the .json.gz/.mp4 pair and inspect random task traces
  3. Compare against known baselines for sanity checks
  4. For significant improvements, we may request additional evidence or reproduction instructions

Need Help?

If you have questions about the submission process: