Skip to content

Agent Benchmarks

Fleet tracks mission outcomes across all supported AI agent adapters — success rate, average duration, and task affinity — so you can route work to the right agent instead of guessing.

2,851 Missions tracked
88% Avg success rate
8 Agents measured
11m Fastest avg (Amp)
Agent Success rate Avg duration Missions Best for
Claude Code
94%
18 min
847
Complex refactorsArchitecture
Gemini CLI
91%
15 min
412
DocumentationAPI specs
Codex
89%
12 min
623
Algorithmic tasksData processing
Aider
88%
14 min
389
Targeted patchesFocused changes
OpenCode
87%
16 min
201
General codingMulti-file changes
Amp
86%
11 min
134
Fast iterationsSimple features
Cursor
85%
19 min
156
UI / FrontendReact / CSS
A2A Protocol
82%
22 min
89
Cross-service integrationAPI design
Claude Code 94%

Highest quality output. Best for critical, complex, or high-risk missions.

Complex refactorsArchitectureTest coverageSecurity audits
⏱ 18 min avg 📊 847 missions
Gemini CLI 91%

Strong structured writing. Excellent for docs-heavy or API-description work.

DocumentationAPI specsMulti-file writesExplanations
⏱ 15 min avg 📊 412 missions
Codex 89%

Fast and reliable for well-scoped, clearly defined missions.

Algorithmic tasksData processingQuick fixesDep updates
⏱ 12 min avg 📊 623 missions
Aider 88%

Conservative scope — minimal drift. Best when precision matters more than speed.

Targeted patchesFocused changesDependency updates
⏱ 14 min avg 📊 389 missions
OpenCode 87%

Solid all-rounder. Rapidly improving — watch this space.

General codingMulti-file changesRefactoring
⏱ 16 min avg 📊 201 missions
Amp 86%

Fastest overall. Best for low-risk, high-parallelism work.

Fast iterationsSimple featuresPrototyping
⏱ 11 min avg 📊 134 missions
Cursor 85%

Strongest in visual and UI-heavy missions. Slower on backend-only work.

UI / FrontendReact / CSSComponent work
⏱ 19 min avg 📊 156 missions
A2A Protocol 82%

Results vary with upstream agent quality. Best for integration-layer missions.

Cross-service integrationAPI designOrchestration
⏱ 22 min avg 📊 89 missions

Best-fit routing by task type

Use this guide to pick the right agent per mission — or mix agents across ships for the same fleet run.

Task type ⚡ Fastest 🏆 Highest quality ⚖️ Best balance
Test coverage Codex Claude Code Gemini CLI
Security audit Amp Claude Code Aider
API documentation Codex Gemini CLI OpenCode
Dependency updates Amp Aider Codex
Refactoring Codex Claude Code OpenCode
UI / Frontend Amp Cursor OpenCode
Architecture work OpenCode Claude Code Claude Code
Bug fixes Amp Claude Code Aider

📊 Data based on 2,851 mission runs reported by the Fleet community. Success rate = mission completed and merged without manual intervention. Duration = wall-clock time from agent start to PR merge. Contribute your data →

Pick by success rate when mission reliability matters most — failed or stalled missions block the merge queue and delay dependent work.

Pick by duration when you’re running many parallel missions and want the fleet to finish fast. Assign fast agents (Amp, Codex) to simple missions and let slower high-quality agents (Claude Code) handle the complex ones simultaneously.

Mix agents in the same fleet — each ship picks up one mission and runs independently. There’s no requirement to use the same agent everywhere:

# .fleet/config.yml — different agents per ship
ships:
- name: alpha
adapter: claude-code # complex refactor mission
- name: beta
adapter: codex # quick dependency update mission
- name: gamma
adapter: gemini # docs mission

A mission counts as successful when:

  1. The agent completes its work and pushes a branch
  2. Fleet’s CI check passes
  3. The PR merges without manual intervention

Missions that stall (no heartbeat), time out, or require human unblocking count as failed for benchmark purposes.

These benchmarks grow with community contributions. If you’re running Fleet regularly, open a discussion on GitHub and share your agent, task type, mission count, and outcomes. Aggregate data is added to the benchmark set monthly.

The raw data and methodology will be published as a JSON file in the repository when the v1.5 telemetry dashboard ships.