Agent Benchmarks
Fleet tracks mission outcomes across all supported AI agent adapters — success rate, average duration, and task affinity — so you can route work to the right agent instead of guessing.
| Agent | Success rate | Avg duration | Missions | Best for |
|---|---|---|---|---|
| Claude Code | 847 | |||
| Gemini CLI | 412 | |||
| Codex | 623 | |||
| Aider | 389 | |||
| OpenCode | 201 | |||
| Amp | 134 | |||
| Cursor | 156 | |||
| A2A Protocol | 89 |
Highest quality output. Best for critical, complex, or high-risk missions.
Strong structured writing. Excellent for docs-heavy or API-description work.
Fast and reliable for well-scoped, clearly defined missions.
Conservative scope — minimal drift. Best when precision matters more than speed.
Solid all-rounder. Rapidly improving — watch this space.
Fastest overall. Best for low-risk, high-parallelism work.
Strongest in visual and UI-heavy missions. Slower on backend-only work.
Results vary with upstream agent quality. Best for integration-layer missions.
Best-fit routing by task type
Use this guide to pick the right agent per mission — or mix agents across ships for the same fleet run.
| Task type | ⚡ Fastest | 🏆 Highest quality | ⚖️ Best balance |
|---|---|---|---|
| Test coverage | Codex | Claude Code | Gemini CLI |
| Security audit | Amp | Claude Code | Aider |
| API documentation | Codex | Gemini CLI | OpenCode |
| Dependency updates | Amp | Aider | Codex |
| Refactoring | Codex | Claude Code | OpenCode |
| UI / Frontend | Amp | Cursor | OpenCode |
| Architecture work | OpenCode | Claude Code | Claude Code |
| Bug fixes | Amp | Claude Code | Aider |
📊 Data based on 2,851 mission runs reported by the Fleet community. Success rate = mission completed and merged without manual intervention. Duration = wall-clock time from agent start to PR merge. Contribute your data →
How to use this data
Section titled “How to use this data”Pick by success rate when mission reliability matters most — failed or stalled missions block the merge queue and delay dependent work.
Pick by duration when you’re running many parallel missions and want the fleet to finish fast. Assign fast agents (Amp, Codex) to simple missions and let slower high-quality agents (Claude Code) handle the complex ones simultaneously.
Mix agents in the same fleet — each ship picks up one mission and runs independently. There’s no requirement to use the same agent everywhere:
# .fleet/config.yml — different agents per shipships: - name: alpha adapter: claude-code # complex refactor mission - name: beta adapter: codex # quick dependency update mission - name: gamma adapter: gemini # docs missionWhat “success” means
Section titled “What “success” means”A mission counts as successful when:
- The agent completes its work and pushes a branch
- Fleet’s CI check passes
- The PR merges without manual intervention
Missions that stall (no heartbeat), time out, or require human unblocking count as failed for benchmark purposes.
Contribute your data
Section titled “Contribute your data”These benchmarks grow with community contributions. If you’re running Fleet regularly, open a discussion on GitHub and share your agent, task type, mission count, and outcomes. Aggregate data is added to the benchmark set monthly.
The raw data and methodology will be published as a JSON file in the repository when the v1.5 telemetry dashboard ships.