Agent Benchmarks

Fleet tracks mission outcomes across all supported AI agent adapters — success rate, average duration, and task affinity — so you can route work to the right agent instead of guessing.

2,851 Missions tracked

88% Avg success rate

8 Agents measured

11m Fastest avg (Amp)

Agent	Success rate	Avg duration	Missions	Best for
Claude Code	94%	18 min	847	Complex refactorsArchitecture
Gemini CLI	91%	15 min	412	DocumentationAPI specs
Codex	89%	12 min	623	Algorithmic tasksData processing
Aider	88%	14 min	389	Targeted patchesFocused changes
OpenCode	87%	16 min	201	General codingMulti-file changes
Amp	86%	11 min	134	Fast iterationsSimple features
Cursor	85%	19 min	156	UI / FrontendReact / CSS
A2A Protocol	82%	22 min	89	Cross-service integrationAPI design

Claude Code 94%

Highest quality output. Best for critical, complex, or high-risk missions.

Complex refactorsArchitectureTest coverageSecurity audits

⏱ 18 min avg 📊 847 missions

Gemini CLI 91%

Strong structured writing. Excellent for docs-heavy or API-description work.

DocumentationAPI specsMulti-file writesExplanations

⏱ 15 min avg 📊 412 missions

Codex 89%

Fast and reliable for well-scoped, clearly defined missions.

Algorithmic tasksData processingQuick fixesDep updates

⏱ 12 min avg 📊 623 missions

Aider 88%

Conservative scope — minimal drift. Best when precision matters more than speed.

Targeted patchesFocused changesDependency updates

⏱ 14 min avg 📊 389 missions

OpenCode 87%

Solid all-rounder. Rapidly improving — watch this space.

General codingMulti-file changesRefactoring

⏱ 16 min avg 📊 201 missions

Amp 86%

Fastest overall. Best for low-risk, high-parallelism work.

Fast iterationsSimple featuresPrototyping

⏱ 11 min avg 📊 134 missions

Cursor 85%

Strongest in visual and UI-heavy missions. Slower on backend-only work.

UI / FrontendReact / CSSComponent work

⏱ 19 min avg 📊 156 missions

A2A Protocol 82%

Results vary with upstream agent quality. Best for integration-layer missions.

Cross-service integrationAPI designOrchestration

⏱ 22 min avg 📊 89 missions

Best-fit routing by task type

Use this guide to pick the right agent per mission — or mix agents across ships for the same fleet run.

Task type	⚡ Fastest	🏆 Highest quality	⚖️ Best balance
Test coverage	Codex	Claude Code	Gemini CLI
Security audit	Amp	Claude Code	Aider
API documentation	Codex	Gemini CLI	OpenCode
Dependency updates	Amp	Aider	Codex
Refactoring	Codex	Claude Code	OpenCode
UI / Frontend	Amp	Cursor	OpenCode
Architecture work	OpenCode	Claude Code	Claude Code
Bug fixes	Amp	Claude Code	Aider

📊 Data based on 2,851 mission runs reported by the Fleet community. Success rate = mission completed and merged without manual intervention. Duration = wall-clock time from agent start to PR merge. Contribute your data →

How to use this data

Pick by success rate when mission reliability matters most — failed or stalled missions block the merge queue and delay dependent work.

Pick by duration when you’re running many parallel missions and want the fleet to finish fast. Assign fast agents (Amp, Codex) to simple missions and let slower high-quality agents (Claude Code) handle the complex ones simultaneously.

Mix agents in the same fleet — each ship picks up one mission and runs independently. There’s no requirement to use the same agent everywhere:

# .fleet/config.yml — different agents per ship
ships:
  - name: alpha
    adapter: claude-code   # complex refactor mission
  - name: beta
    adapter: codex         # quick dependency update mission
  - name: gamma
    adapter: gemini        # docs mission

What “success” means

A mission counts as successful when:

The agent completes its work and pushes a branch
Fleet’s CI check passes
The PR merges without manual intervention

Missions that stall (no heartbeat), time out, or require human unblocking count as failed for benchmark purposes.

Contribute your data

These benchmarks grow with community contributions. If you’re running Fleet regularly, open a discussion on GitHub and share your agent, task type, mission count, and outcomes. Aggregate data is added to the benchmark set monthly.

The raw data and methodology will be published as a JSON file in the repository when the v1.5 telemetry dashboard ships.