Submit Results | MemGUI-Bench

Quick Start

1

Run Benchmark

Execute your agent on MemGUI-Bench using python run.py. Results are saved automatically.

2

Fill Metadata

Edit the generated {agent}.json file to fill in metadata fields (backbone, institution, etc.).

3

Submit via GitHub

Submit results JSON via GitHub PR, and upload trajectories to HuggingFace.

Submission Methods

Results JSON → GitHub Pull Request

Submit your leaderboard results via GitHub PR:

Fork lgy0404/MemGUI-Bench
Copy your {agent}.json to docs/data/agents/ directory
Create a Pull Request with title: [Leaderboard] Add {Your Agent Name}
Our team will review and merge within 3-5 business days

Trajectories → HuggingFace Dataset

Upload your execution trajectories via Pull Request:

Compress your session folder: zip -r your-agent-name.zip session-{id}
Go to lgy0404/memgui-bench-trajs
Click Community tab → New Pull Request → Upload files
Upload your zip file and submit the PR

Results Format

After running python run.py, a JSON file {agent}.json is generated in your session folder. Fill in the metadata fields before submission:

{
  "name": "YourAgent",
  "backbone": "-",                    // Model backbone (e.g., "Gemini-2.5-Pro", or "-" for fine-tuned models)
  "type": "",                         // "Agentic Workflow" or "Agent-as-a-Model"
  "institution": "",                  // Your institution name
  "date": "",                         // Submission date (YYYY-MM-DD)
  "paperLink": "",                    // Paper URL (arxiv, etc.)
  "codeLink": "",                     // Code repository URL
  "hasUITree": false,                 // true if agent uses UI tree/accessibility tree
  "hasLongTermMemory": false,         // true if agent has long-term memory capability
  
  // ↓↓↓ Auto-generated metrics (do not modify) ↓↓↓
  "crossApp": {
    "app1": { "p1": 21.4, "p3": 35.7, "irr": 11.7 },
    "app2": { "p1": 1.8, "p3": 1.8, "irr": 3.6 },
    "app3": { "p1": 2.9, "p3": 7.1, "irr": 7.1 },
    "app4": { "p1": 0.0, "p3": 4.0, "irr": 4.0 }
  },
  "difficulty": {
    "easy": { "p1": 14.6, "p3": 22.9, "irr": 9.89 },
    "medium": { "p1": 0.0, "p3": 2.4, "irr": 4.98 },
    "hard": { "p1": 2.6, "p3": 2.6, "irr": 2.63 }
  },
  "avg": { "p1": 6.2, "p3": 10.2 },
  "metrics": {
    "shortTerm": {
      "irr": 5.7,
      "mtpr": 0.07,
      "stepRatio": 0.92,
      "timePerStep": 9.6,
      "costPerStep": null
    },
    "longTerm": {
      "frr": 3.3,
      "stepRatio": 0.93,
      "timePerStep": 9.6,
      "costPerStep": null
    }
  }
}

Metadata Fields to Fill

Field	Description	Example
`backbone`	Base model used	"Gemini-2.5-Pro", "GPT-4o", "-" (for fine-tuned)
`type`	Agent architecture type	"Agentic Workflow" or "Agent-as-a-Model"
`institution`	Affiliation	"MIT", "Google DeepMind"
`date`	Submission date	"2026-02-03"
`paperLink`	Paper URL	"https://arxiv.org/abs/..."
`codeLink`	Code repository	"https://github.com/..."
`hasUITree`	Uses accessibility tree?	true / false
`hasLongTermMemory`	Has long-term memory?	true / false

Trajectory Data

Important: Trajectory Submission

Trajectory submission is optional but highly recommended for the following reasons:

Enables verification of your results by our team
Allows reproduction and analysis by other researchers
Contributes to the research community's understanding of agent behavior

Session Output Structure

When you run python run.py, the following structure is generated:

results/session-{session_id}/
├── results.csv                    # Aggregated execution & evaluation metrics
├── results.csv.lock               # File lock for concurrent access
├── metrics_summary.json           # Computed benchmark metrics
├── {agent_name}.json              # Leaderboard format (submit this!)
├── config.yaml                    # Config snapshot for reproducibility
│
└── {task_id}/                     # Per-task results
    └── {agent_name}/
        └── attempt_{n}/
            ├── log.json                    # Execution log with actions
            ├── 0.png, 1.png, ...          # Raw screenshots per step
            ├── stdout.txt, stderr.txt     # Process output logs
            ├── error.json                 # Error info (if any)
            │
            ├── visualize_actions/         # Action visualization images
            │   └── step_1.png, step_2.png, ...
            │
            ├── single_actions/            # Individual action screenshots
            │   └── step_1.png, step_2.png, ...
            │
            ├── puzzle/                    # Evaluation puzzle images
            │   ├── puzzle.png
            │   ├── pre_eval_puzzle.png
            │   └── supplemental_puzzle.png (if needed)
            │
            ├── evaluation_summary.json    # Detailed evaluation results
            ├── final_decision.json        # Final evaluation decision
            ├── irr_analysis.json          # IRR evaluation results
            ├── badcase_analysis.json      # BadCase classification
            └── step_*_description.json    # Step-by-step analysis

Upload to HuggingFace

Compress your session folder and submit via Pull Request to lgy0404/memgui-bench-trajs:

# Step 1: Compress session folder (filename should match your agent JSON)
cd results
zip -r your-agent-name.zip session-{session_id}

# Step 2: Upload via HuggingFace Web UI
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" tab → "New Pull Request" → "Upload files"
# 3. Upload your zip file and submit the PR

File Naming Convention

Name your zip file to match your agent JSON filename (lowercase with hyphens), e.g.:

m3a.zip
appagent.zip
mobile-agent-v2.zip

Review Process

PR Submitted

Create a Pull Request to add your JSON to data/agents/.

Format Validation

We verify that all required fields are present and correctly formatted.

Results Verification

If trajectories are provided on HuggingFace, we run spot-checks using MemGUI-Eval.

Leaderboard Update

Approved results are merged and visible on the leaderboard within 3-5 business days.

Frequently Asked Questions

Trajectory submission is optional but recommended. While we accept results without trajectories, providing them enables verification and helps the research community. Submissions with trajectories may receive priority in the review process.

Yes! You can submit updated results at any time via a new PR. We maintain version history and will update the leaderboard with your latest results. Please indicate in your PR that this is an update to existing results.

All performance metrics are auto-generated:

crossApp: Pass@1, Pass@3, IRR by number of apps (1-4)
difficulty: Pass@1, Pass@3, IRR by difficulty (easy, medium, hard)
avg: Overall Pass@1 and Pass@3
metrics.shortTerm: IRR, MTPR, stepRatio, timePerStep, costPerStep
metrics.longTerm: FRR, stepRatio, timePerStep, costPerStep

You only need to fill in metadata fields (backbone, type, institution, etc.)

We perform the following verification steps:

Check format compliance and metric calculations
If trajectories are provided on HuggingFace, run MemGUI-Eval on a random sample (10-20% of tasks)
Compare against known baselines for sanity checks
For significant improvements, we may request additional evidence or reproduction instructions

Need Help?

If you have questions about the submission process:

Open an Issue