Quick Start
Run Benchmark
Launch MemGUI containers and run your agent with mg eval. Logs are written to local traj_logs/.
Fill Metadata
Create or update docs/data/agents/{agent}.json with metadata and MemGUI-Eval metrics.
Submit via GitHub
Submit the results JSON via GitHub PR. Optionally upload the generated .json.gz and .mp4 trajectory preview files to HuggingFace.
Submission Methods
Results JSON → GitHub Pull Request
Submit your leaderboard results via GitHub PR:
- Fork lgy0404/MemGUI-Bench
- Run the official benchmark with
mg evaland review the outputs undertraj_logs/{run-name}/_memgui_eval/ - Add your results file as
docs/data/agents/{agent-id}.json - Add
{agent-id}todocs/data/index.json - Create a Pull Request with title:
[Leaderboard] Add {Your Agent Name} - Include the command, log root, pass@K setting, and trajectory link in the PR description
- Our team will review and merge within 3-5 business days
Trajectories → HuggingFace Dataset
Upload the static trajectory preview files via HuggingFace Pull Request:
- Generate
docs/trajs/{agent-id}.json.gzanddocs/trajs/{agent-id}.mp4from your run - Go to lgy0404/memgui-bench-trajs
- Click Community tab → New Pull Request → Upload files
- Upload both files under
site/trajs/and submit the PR
Results Format
Run the benchmark with the current MobileWorld-style CLI. Official leaderboard submissions should use the full task set and pass@3 unless otherwise noted by the benchmark maintainers:
sudo uv run mg eval \
--agent-type qwen3vl \
--model-name your-model-name \
--task ALL \
--pass-at-k 3 \
--log-file-root traj_logs/your-run-name \
--max-concurrency 2
After the run, MemGUI-Eval artifacts are saved under traj_logs/your-run-name/_memgui_eval/. Create or update docs/data/agents/{agent-id}.json using the schema below. Fill metadata manually and copy benchmark metrics from the MemGUI-Eval results.
{
"name": "YourAgent",
"backbone": "-", // Model backbone (e.g., "Gemini-2.5-Pro", or "-" for fine-tuned models)
"type": "", // "Agentic Workflow" or "Agent-as-a-Model"
"institution": "", // Your institution name
"date": "", // Submission date (YYYY-MM-DD)
"paperLink": "", // Paper URL (arxiv, etc.)
"codeLink": "", // Code repository URL
"trajFile": "trajs/your-agent-name.json.gz", // Optional; include when trajectory preview files are submitted
"hasUITree": false, // true if agent uses UI tree/accessibility tree
"hasLongTermMemory": false, // true if agent has long-term memory capability
// ↓↓↓ Auto-generated metrics (do not modify) ↓↓↓
"crossApp": {
"app1": { "p1": 21.4, "p3": 35.7, "irr": 11.7 },
"app2": { "p1": 1.8, "p3": 1.8, "irr": 3.6 },
"app3": { "p1": 2.9, "p3": 7.1, "irr": 7.1 },
"app4": { "p1": 0.0, "p3": 4.0, "irr": 4.0 }
},
"difficulty": {
"easy": { "p1": 14.6, "p3": 22.9, "irr": 9.89 },
"medium": { "p1": 0.0, "p3": 2.4, "irr": 4.98 },
"hard": { "p1": 2.6, "p3": 2.6, "irr": 2.63 }
},
"avg": { "p1": 6.2, "p3": 10.2 },
"metrics": {
"shortTerm": {
"irr": 5.7,
"mtpr": 0.07,
"stepRatio": 0.92,
"timePerStep": 9.6,
"costPerStep": null
},
"longTerm": {
"frr": 3.3,
"stepRatio": 0.93,
"timePerStep": 9.6,
"costPerStep": null
}
}
}
Metadata Fields to Fill
| Field | Description | Example |
|---|---|---|
backbone | Base model used | "Gemini-2.5-Pro", "GPT-4o", "-" (for fine-tuned) |
type | Agent architecture type | "Agentic Workflow" or "Agent-as-a-Model" |
institution | Affiliation | "MIT", "Google DeepMind" |
date | Submission date | "2026-02-03" |
paperLink | Paper URL | "https://arxiv.org/abs/..." |
codeLink | Code repository | "https://github.com/..." |
trajFile | Logical trajectory bundle path, only when preview files are submitted | "trajs/your-agent-name.json.gz" |
hasUITree | Uses accessibility tree? | true / false |
hasLongTermMemory | Has long-term memory? | true / false |
Trajectory Data
Important: Trajectory Submission
Trajectory submission is optional but highly recommended for the following reasons:
- Enables verification of your results by our team
- Allows reproduction and analysis by other researchers
- Contributes to the research community's understanding of agent behavior
Session Output Structure
When you run mg eval, the following structure is generated under the local log root:
traj_logs/your-run-name/
├── metadata.json
├── 001-FindProductAndFilter/
│ ├── traj.json # MobileWorld trajectory
│ ├── result.txt # Task-level score
│ ├── thread_<id>.log
│ ├── screenshots/
│ └── marked_screenshots/
├── _attempt_trajs/
│ └── 001-FindProductAndFilter/
│ └── attempt_2/
│ ├── traj.json
│ ├── result.txt
│ └── screenshots/
└── _memgui_eval/
├── results.csv # Aggregated MemGUI-Eval rows
└── 001-FindProductAndFilter/
└── qwen3vl/
└── attempt_1/
├── log.json
├── 0.png, 1.png, ...
├── final_decision.json
└── evaluation_summary.json
Upload to HuggingFace
Generate the static trajectory preview bundle and submit the two output files via Pull Request to lgy0404/memgui-bench-trajs:
# Step 1: Generate the static preview bundle
python3 docs/bundle_trajs.py traj_logs/your-run-name \
-o docs/trajs/your-agent-name.json.gz \
--with-screenshots
# This creates:
# docs/trajs/your-agent-name.json.gz
# docs/trajs/your-agent-name.mp4
# Step 2: Upload via HuggingFace Web UI
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" tab → "New Pull Request" → "Upload files"
# 3. Upload both files to site/trajs/ and submit the PR
File Naming Convention
Name both preview files to match your agent JSON filename (lowercase with hyphens), e.g.:
docs/trajs/m3a.json.gzanddocs/trajs/m3a.mp4docs/trajs/appagent.json.gzanddocs/trajs/appagent.mp4docs/trajs/mobile-agent-v2.json.gzanddocs/trajs/mobile-agent-v2.mp4
Do not submit raw traj_logs zip files for leaderboard previews. Maintainers will review the PR and update the public manifest after the files are accepted.
Review Process
PR Submitted
Create a Pull Request to add your JSON to docs/data/agents/ and update docs/data/index.json.
Format Validation
We verify that all required fields are present and correctly formatted.
Results Verification
If trajectory preview files are provided on HuggingFace, we validate the bundle and inspect random task traces.
Leaderboard Update
Approved results are merged and visible on the leaderboard within 3-5 business days.
Frequently Asked Questions
- crossApp: Pass@1, Pass@3, IRR by number of apps (1-4)
- difficulty: Pass@1, Pass@3, IRR by difficulty (easy, medium, hard)
- avg: Overall Pass@1 and Pass@3
- metrics.shortTerm: IRR, MTPR, stepRatio, timePerStep, costPerStep
- metrics.longTerm: FRR, stepRatio, timePerStep, costPerStep
_memgui_eval/results.csv, and the viewer can summarize it. For the leaderboard JSON, keep those computed metric values consistent with the run and fill in the metadata fields (backbone, type, institution, etc.).
- Check format compliance and metric calculations
- If trajectory preview files are provided on HuggingFace, validate the
.json.gz/.mp4pair and inspect random task traces - Compare against known baselines for sanity checks
- For significant improvements, we may request additional evidence or reproduction instructions
Need Help?
If you have questions about the submission process: