Quick Start
Run Benchmark
Execute your agent on MemGUI-Bench using python run.py. Results are saved automatically.
Fill Metadata
Edit the generated {agent}.json file to fill in metadata fields (backbone, institution, etc.).
Submit via GitHub
Submit results JSON via GitHub PR, and upload trajectories to HuggingFace.
Submission Methods
Results JSON → GitHub Pull Request
Submit your leaderboard results via GitHub PR:
- Fork lgy0404/MemGUI-Bench
- Copy your
{agent}.jsontodocs/data/agents/directory - Create a Pull Request with title:
[Leaderboard] Add {Your Agent Name} - Our team will review and merge within 3-5 business days
Trajectories → HuggingFace Dataset
Upload your execution trajectories via Pull Request:
- Compress your session folder:
zip -r your-agent-name.zip session-{id} - Go to lgy0404/memgui-bench-trajs
- Click Community tab → New Pull Request → Upload files
- Upload your zip file and submit the PR
Results Format
After running python run.py, a JSON file {agent}.json is generated in your session folder. Fill in the metadata fields before submission:
{
"name": "YourAgent",
"backbone": "-", // Model backbone (e.g., "Gemini-2.5-Pro", or "-" for fine-tuned models)
"type": "", // "Agentic Workflow" or "Agent-as-a-Model"
"institution": "", // Your institution name
"date": "", // Submission date (YYYY-MM-DD)
"paperLink": "", // Paper URL (arxiv, etc.)
"codeLink": "", // Code repository URL
"hasUITree": false, // true if agent uses UI tree/accessibility tree
"hasLongTermMemory": false, // true if agent has long-term memory capability
// ↓↓↓ Auto-generated metrics (do not modify) ↓↓↓
"crossApp": {
"app1": { "p1": 21.4, "p3": 35.7, "irr": 11.7 },
"app2": { "p1": 1.8, "p3": 1.8, "irr": 3.6 },
"app3": { "p1": 2.9, "p3": 7.1, "irr": 7.1 },
"app4": { "p1": 0.0, "p3": 4.0, "irr": 4.0 }
},
"difficulty": {
"easy": { "p1": 14.6, "p3": 22.9, "irr": 9.89 },
"medium": { "p1": 0.0, "p3": 2.4, "irr": 4.98 },
"hard": { "p1": 2.6, "p3": 2.6, "irr": 2.63 }
},
"avg": { "p1": 6.2, "p3": 10.2 },
"metrics": {
"shortTerm": {
"irr": 5.7,
"mtpr": 0.07,
"stepRatio": 0.92,
"timePerStep": 9.6,
"costPerStep": null
},
"longTerm": {
"frr": 3.3,
"stepRatio": 0.93,
"timePerStep": 9.6,
"costPerStep": null
}
}
}
Metadata Fields to Fill
| Field | Description | Example |
|---|---|---|
backbone | Base model used | "Gemini-2.5-Pro", "GPT-4o", "-" (for fine-tuned) |
type | Agent architecture type | "Agentic Workflow" or "Agent-as-a-Model" |
institution | Affiliation | "MIT", "Google DeepMind" |
date | Submission date | "2026-02-03" |
paperLink | Paper URL | "https://arxiv.org/abs/..." |
codeLink | Code repository | "https://github.com/..." |
hasUITree | Uses accessibility tree? | true / false |
hasLongTermMemory | Has long-term memory? | true / false |
Trajectory Data
Important: Trajectory Submission
Trajectory submission is optional but highly recommended for the following reasons:
- Enables verification of your results by our team
- Allows reproduction and analysis by other researchers
- Contributes to the research community's understanding of agent behavior
Session Output Structure
When you run python run.py, the following structure is generated:
results/session-{session_id}/
├── results.csv # Aggregated execution & evaluation metrics
├── results.csv.lock # File lock for concurrent access
├── metrics_summary.json # Computed benchmark metrics
├── {agent_name}.json # Leaderboard format (submit this!)
├── config.yaml # Config snapshot for reproducibility
│
└── {task_id}/ # Per-task results
└── {agent_name}/
└── attempt_{n}/
├── log.json # Execution log with actions
├── 0.png, 1.png, ... # Raw screenshots per step
├── stdout.txt, stderr.txt # Process output logs
├── error.json # Error info (if any)
│
├── visualize_actions/ # Action visualization images
│ └── step_1.png, step_2.png, ...
│
├── single_actions/ # Individual action screenshots
│ └── step_1.png, step_2.png, ...
│
├── puzzle/ # Evaluation puzzle images
│ ├── puzzle.png
│ ├── pre_eval_puzzle.png
│ └── supplemental_puzzle.png (if needed)
│
├── evaluation_summary.json # Detailed evaluation results
├── final_decision.json # Final evaluation decision
├── irr_analysis.json # IRR evaluation results
├── badcase_analysis.json # BadCase classification
└── step_*_description.json # Step-by-step analysis
Upload to HuggingFace
Compress your session folder and submit via Pull Request to lgy0404/memgui-bench-trajs:
# Step 1: Compress session folder (filename should match your agent JSON)
cd results
zip -r your-agent-name.zip session-{session_id}
# Step 2: Upload via HuggingFace Web UI
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" tab → "New Pull Request" → "Upload files"
# 3. Upload your zip file and submit the PR
File Naming Convention
Name your zip file to match your agent JSON filename (lowercase with hyphens), e.g.:
m3a.zipappagent.zipmobile-agent-v2.zip
Review Process
PR Submitted
Create a Pull Request to add your JSON to data/agents/.
Format Validation
We verify that all required fields are present and correctly formatted.
Results Verification
If trajectories are provided on HuggingFace, we run spot-checks using MemGUI-Eval.
Leaderboard Update
Approved results are merged and visible on the leaderboard within 3-5 business days.
Frequently Asked Questions
- crossApp: Pass@1, Pass@3, IRR by number of apps (1-4)
- difficulty: Pass@1, Pass@3, IRR by difficulty (easy, medium, hard)
- avg: Overall Pass@1 and Pass@3
- metrics.shortTerm: IRR, MTPR, stepRatio, timePerStep, costPerStep
- metrics.longTerm: FRR, stepRatio, timePerStep, costPerStep
- Check format compliance and metric calculations
- If trajectories are provided on HuggingFace, run MemGUI-Eval on a random sample (10-20% of tasks)
- Compare against known baselines for sanity checks
- For significant improvements, we may request additional evidence or reproduction instructions
Need Help?
If you have questions about the submission process: