SWE-Bench-Pro agent results
CodeAgentBench
A leaderboard for coding agents on SWE-Bench-Pro: 151 tasks, 3 attempts per task, and 453 scoreable attempts in total. Rankings include all completed 453/453 runs and sort by Pass@3 by default.
Best Pass@3
37.7%
Codex - GPT 5.3 codex (xhigh)
Completed Models
22
All rows have 453 scoreable attempts
Scoreable Attempts
453
151 tasks x 3 tries
Exported
2026-06-03
SWEPro 151 zh pass@3
Benchmark Leaderboard
Pass@3 is the primary ranking signal. The table also shows Pass^3, per-attempt solve rate, solved tasks, solved attempts, and 453/453 coverage status for each agent/model pair.
Agent
| # | Agent / Model | Pass@3 | Pass^3 | Attempt Score | Solved Tasks | Solved Attempts | Coverage | Log Archive | Full Tree | Exported | Model Dir |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 |
|
57/151 - 37.7%
|
31/151 - 20.5%
|
131/453 - 28.9%
|
57/151 | 131/453 | 453/453 100% | 202.4 MB | 160.91 MB | 2026-06-03 |
codex-gpt53-xhigh
|
| 2 |
|
55/151 - 36.4%
|
35/151 - 23.2%
|
136/453 - 30.0%
|
55/151 | 136/453 | 453/453 100% | 212.55 MB | 173.26 MB | 2026-06-03 |
codex-gpt54
|
| 3 |
|
51/151 - 33.8%
|
40/151 - 26.5%
|
136/453 - 30.0%
|
51/151 | 136/453 | 453/453 100% | 228.15 MB | 193.18 MB | 2026-06-03 |
codex-gpt55
|
| 4 |
|
51/151 - 33.8%
|
22/151 - 14.6%
|
105/453 - 23.2%
|
51/151 | 105/453 | 453/453 100% | 28.18 MB | 36.44 MB | 2026-06-03 |
qwen-3.5plus
|
| 5 |
|
48/151 - 31.8%
|
28/151 - 18.5%
|
117/453 - 25.8%
|
48/151 | 117/453 | 453/453 100% | 183.66 MB | 156.39 MB | 2026-05-22 |
opencode-deepseek-v4-flash-max
|
| 6 |
|
46/151 - 30.5%
|
22/151 - 14.6%
|
102/453 - 22.5%
|
46/151 | 102/453 | 453/453 100% | 170.12 MB | 151.78 MB | 2026-06-03 |
opencode-mimo25pro-high
|
| 7 |
|
46/151 - 30.5%
|
22/151 - 14.6%
|
102/453 - 22.5%
|
46/151 | 102/453 | 453/453 100% | 173.6 MB | 167.83 MB | 2026-06-03 |
opencode-minimax25-highspeed
|
| 8 |
|
45/151 - 29.8%
|
27/151 - 17.9%
|
110/453 - 24.3%
|
45/151 | 110/453 | 453/453 100% | 165.27 MB | 142.29 MB | 2026-06-03 |
opencode-glm51
|
| 9 |
|
44/151 - 29.1%
|
25/151 - 16.6%
|
103/453 - 22.7%
|
44/151 | 103/453 | 453/453 100% | 30.16 MB | 37.39 MB | 2026-06-03 |
claude-mimo25pro
|
| 10 |
|
44/151 - 29.1%
|
19/151 - 12.6%
|
99/453 - 21.9%
|
44/151 | 99/453 | 453/453 100% | 19.75 MB | 123.7 MB |
opencode-glm5turbo
|
|
| 11 |
|
44/151 - 29.1%
|
20/151 - 13.2%
|
95/453 - 21.0%
|
44/151 | 95/453 | 453/453 100% | 195.44 MB | 166.59 MB | 2026-06-03 |
opencode-deepseek-v4-pro-max
|
| 12 |
|
44/151 - 29.1%
|
18/151 - 11.9%
|
95/453 - 21.0%
|
44/151 | 95/453 | 453/453 100% | 30.57 MB | 36.26 MB | 2026-06-03 |
qwen-3.6plus
|
| 13 |
|
41/151 - 27.2%
|
17/151 - 11.3%
|
86/453 - 19.0%
|
41/151 | 86/453 | 453/453 100% | 30.29 MB | 38.14 MB | 2026-06-03 |
claude-mimo25
|
| 14 |
|
38/151 - 25.2%
|
15/151 - 9.9%
|
76/453 - 16.8%
|
38/151 | 76/453 | 453/453 100% | 15.21 MB | 108.88 MB |
opencode-minimax27
|
|
| 15 |
|
36/151 - 23.8%
|
13/151 - 8.6%
|
73/453 - 16.1%
|
36/151 | 73/453 | 453/453 100% | 308.12 MB | 183.96 MB | 2026-06-03 |
opencode-glm47
|
| 16 |
|
33/151 - 21.9%
|
18/151 - 11.9%
|
78/453 - 17.2%
|
33/151 | 78/453 | 453/453 100% | 173.96 MB | 153.59 MB | 2026-06-03 |
opencode-mimo25-tokenplan-high-20260527
|
| 17 |
|
33/151 - 21.9%
|
17/151 - 11.3%
|
75/453 - 16.6%
|
33/151 | 75/453 | 453/453 100% | 189.85 MB | 168.72 MB | 2026-06-03 |
opencode-longcat
|
| 18 |
|
33/151 - 21.9%
|
13/151 - 8.6%
|
68/453 - 15.0%
|
33/151 | 68/453 | 453/453 100% | 185.26 MB | 174.21 MB | 2026-06-03 |
opencode-sensenova67-flash-lite
|
| 19 |
|
27/151 - 17.9%
|
9/151 - 6.0%
|
52/453 - 11.5%
|
27/151 | 52/453 | 453/453 100% | 10.73 MB | 74.82 MB |
opencode-doubao-2-code
|
|
| 20 |
|
21/151 - 13.9%
|
8/151 - 5.3%
|
42/453 - 9.3%
|
21/151 | 42/453 | 453/453 100% | 213.12 MB | 190.71 MB | 2026-06-03 |
opencode-stepfun35
|
| 21 |
|
15/151 - 9.9%
|
11/151 - 7.3%
|
38/453 - 8.4%
|
15/151 | 38/453 | 453/453 100% | 28.82 MB | 36.6 MB | 2026-05-22 |
deepseek-tui-v4-flash-max
|
| 22 |
|
13/151 - 8.6%
|
5/151 - 3.3%
|
27/453 - 6.0%
|
13/151 | 27/453 | 453/453 100% | 15.61 MB | 109.34 MB |
opencode-stepfun35-2603
|
Notes:
Pass@3 counts tasks solved at least once across 3 attempts.
Pass^3 counts tasks solved in all 3 attempts.
Attempt Score is solved scoreable attempts divided by 453.
Rows are included when the exported summary reports 453 completed and 453 scoreable attempts.
Visual Leaderboard
Switch metrics to compare the same agent/model pairs by reach, consistency, and per-attempt solve rate.
GPT 5.3 codex (xhigh)
Codex - OpenAI
GPT 5.4
Codex - OpenAI
GPT 5.5
Codex - OpenAI
Qwen 3.5 plus
Qwen - Qwen
DeepSeek v4 flash (max)
OpenCode - DeepSeek
MiMo v2.5 pro (high)
OpenCode - Xiaomi
MiniMax M2.5 highspeed
OpenCode - MiniMax
GLM 5.1
OpenCode - Zhipu GLM
MiMo v2.5 pro
Claude Code - Xiaomi
GLM 5 turbo
OpenCode - Zhipu GLM
DeepSeek v4 pro (max)
OpenCode - DeepSeek
Qwen 3.6 plus
Qwen - Qwen
MiMo v2.5
Claude Code - Xiaomi
MiniMax M2.7 highspeed
OpenCode - MiniMax
GLM 4.7
OpenCode - Zhipu GLM
MiMo v2.5
OpenCode - Xiaomi
LongCat 2.0 Preview
OpenCode - LongCat
SenseNova 6.7 flash lite
OpenCode - SenseNova
doubao seed 2.0 code
OpenCode - Volcengine
Step 3.5 flash
OpenCode - StepFun
DeepSeek v4 flash (max)
DeepSeek - DeepSeek
Step 3.5 flash 2603
OpenCode - StepFun
Reach vs Consistency
Pass@3 shows whether an agent can solve a task at least once; Pass^3 shows whether it solves the same task all three times. The gap is useful when comparing stochastic or retry-sensitive agents.
Pass@3
Codex - GPT 5.3 codex (xhigh)
Codex - GPT 5.4
Codex - GPT 5.5
Qwen - Qwen 3.5 plus
OpenCode - DeepSeek v4 flash (max)
OpenCode - MiMo v2.5 pro (high)
OpenCode - MiniMax M2.5 highspeed
OpenCode - GLM 5.1
Claude Code - MiMo v2.5 pro
OpenCode - GLM 5 turbo
OpenCode - DeepSeek v4 pro (max)
Qwen - Qwen 3.6 plus
Claude Code - MiMo v2.5
OpenCode - MiniMax M2.7 highspeed
OpenCode - GLM 4.7
OpenCode - MiMo v2.5
OpenCode - LongCat 2.0 Preview
OpenCode - SenseNova 6.7 flash lite
OpenCode - doubao seed 2.0 code
OpenCode - Step 3.5 flash
DeepSeek - DeepSeek v4 flash (max)
OpenCode - Step 3.5 flash 2603
Pass^3