SWE-Bench-Pro agent results

CodeAgentBench

A leaderboard for coding agents on SWE-Bench-Pro: 151 tasks, 3 attempts per task, and 453 scoreable attempts in total. Rankings include all completed 453/453 runs and sort by Pass@3 by default.

Best Pass@3 37.7% Codex - GPT 5.3 codex (xhigh)
Completed Models 22 All rows have 453 scoreable attempts
Scoreable Attempts 453 151 tasks x 3 tries
Exported 2026-06-03 SWEPro 151 zh pass@3

Benchmark Leaderboard

Pass@3 is the primary ranking signal. The table also shows Pass^3, per-attempt solve rate, solved tasks, solved attempts, and 453/453 coverage status for each agent/model pair.

Agent
# Agent / Model Pass@3 Pass^3 Attempt Score Solved Tasks Solved Attempts Coverage Log Archive Full Tree Exported Model Dir
1
GPT 5.3 codex (xhigh) CodexOpenAI - gpt-5.3-codex#effort=xhigh
57/151 - 37.7%
31/151 - 20.5%
131/453 - 28.9%
57/151 131/453 453/453 100% 202.4 MB 160.91 MB 2026-06-03 codex-gpt53-xhigh
2
GPT 5.4 CodexOpenAI - gpt-5.4
55/151 - 36.4%
35/151 - 23.2%
136/453 - 30.0%
55/151 136/453 453/453 100% 212.55 MB 173.26 MB 2026-06-03 codex-gpt54
3
GPT 5.5 CodexOpenAI - gpt-5.5
51/151 - 33.8%
40/151 - 26.5%
136/453 - 30.0%
51/151 136/453 453/453 100% 228.15 MB 193.18 MB 2026-06-03 codex-gpt55
4
Qwen 3.5 plus QwenQwen - qwen3.5-plus
51/151 - 33.8%
22/151 - 14.6%
105/453 - 23.2%
51/151 105/453 453/453 100% 28.18 MB 36.44 MB 2026-06-03 qwen-3.5plus
5
DeepSeek v4 flash (max) OpenCodeDeepSeek - deepseek/deepseek-v4-flash#variant=max
48/151 - 31.8%
28/151 - 18.5%
117/453 - 25.8%
48/151 117/453 453/453 100% 183.66 MB 156.39 MB 2026-05-22 opencode-deepseek-v4-flash-max
6
MiMo v2.5 pro (high) OpenCodeXiaomi - xiaomi-token-plan-cn/mimo-v2.5-pro#variant=high
46/151 - 30.5%
22/151 - 14.6%
102/453 - 22.5%
46/151 102/453 453/453 100% 170.12 MB 151.78 MB 2026-06-03 opencode-mimo25pro-high
7
MiniMax M2.5 highspeed OpenCodeMiniMax - minimax-cn-coding-plan/MiniMax-M2.5-highspeed
46/151 - 30.5%
22/151 - 14.6%
102/453 - 22.5%
46/151 102/453 453/453 100% 173.6 MB 167.83 MB 2026-06-03 opencode-minimax25-highspeed
8
GLM 5.1 OpenCodeZhipu GLM - zai-coding-plan/glm-5.1
45/151 - 29.8%
27/151 - 17.9%
110/453 - 24.3%
45/151 110/453 453/453 100% 165.27 MB 142.29 MB 2026-06-03 opencode-glm51
9
MiMo v2.5 pro Claude CodeXiaomi - xiaomi/mimo-v2.5-pro
44/151 - 29.1%
25/151 - 16.6%
103/453 - 22.7%
44/151 103/453 453/453 100% 30.16 MB 37.39 MB 2026-06-03 claude-mimo25pro
10
GLM 5 turbo OpenCodeZhipu GLM - zai-coding-plan/glm-5-turbo
44/151 - 29.1%
19/151 - 12.6%
99/453 - 21.9%
44/151 99/453 453/453 100% 19.75 MB 123.7 MB opencode-glm5turbo
11
DeepSeek v4 pro (max) OpenCodeDeepSeek - deepseek/deepseek-v4-pro#variant=max
44/151 - 29.1%
20/151 - 13.2%
95/453 - 21.0%
44/151 95/453 453/453 100% 195.44 MB 166.59 MB 2026-06-03 opencode-deepseek-v4-pro-max
12
Qwen 3.6 plus QwenQwen - qwen3.6-plus
44/151 - 29.1%
18/151 - 11.9%
95/453 - 21.0%
44/151 95/453 453/453 100% 30.57 MB 36.26 MB 2026-06-03 qwen-3.6plus
13
MiMo v2.5 Claude CodeXiaomi - xiaomi/mimo-v2.5
41/151 - 27.2%
17/151 - 11.3%
86/453 - 19.0%
41/151 86/453 453/453 100% 30.29 MB 38.14 MB 2026-06-03 claude-mimo25
14
MiniMax M2.7 highspeed OpenCodeMiniMax - minimax-cn-coding-plan/MiniMax-M2.7-highspeed
38/151 - 25.2%
15/151 - 9.9%
76/453 - 16.8%
38/151 76/453 453/453 100% 15.21 MB 108.88 MB opencode-minimax27
15
GLM 4.7 OpenCodeZhipu GLM - zai-coding-plan/glm-4.7
36/151 - 23.8%
13/151 - 8.6%
73/453 - 16.1%
36/151 73/453 453/453 100% 308.12 MB 183.96 MB 2026-06-03 opencode-glm47
16
MiMo v2.5 OpenCodeXiaomi - xiaomi-token-plan-cn/mimo-v2.5
33/151 - 21.9%
18/151 - 11.9%
78/453 - 17.2%
33/151 78/453 453/453 100% 173.96 MB 153.59 MB 2026-06-03 opencode-mimo25-tokenplan-high-20260527
17
LongCat 2.0 Preview OpenCodeLongCat - LongCat/LongCat-2.0-Preview
33/151 - 21.9%
17/151 - 11.3%
75/453 - 16.6%
33/151 75/453 453/453 100% 189.85 MB 168.72 MB 2026-06-03 opencode-longcat
18
SenseNova 6.7 flash lite OpenCodeSenseNova - sensenova/sensenova-6.7-flash-lite
33/151 - 21.9%
13/151 - 8.6%
68/453 - 15.0%
33/151 68/453 453/453 100% 185.26 MB 174.21 MB 2026-06-03 opencode-sensenova67-flash-lite
19
doubao seed 2.0 code OpenCodeVolcengine - volcengine-plan/doubao-seed-2.0-code
27/151 - 17.9%
9/151 - 6.0%
52/453 - 11.5%
27/151 52/453 453/453 100% 10.73 MB 74.82 MB opencode-doubao-2-code
20
Step 3.5 flash OpenCodeStepFun - stepfun/step-3.5-flash
21/151 - 13.9%
8/151 - 5.3%
42/453 - 9.3%
21/151 42/453 453/453 100% 213.12 MB 190.71 MB 2026-06-03 opencode-stepfun35
21
DeepSeek v4 flash (max) DeepSeekDeepSeek - deepseek-v4-flash#effort=max
15/151 - 9.9%
11/151 - 7.3%
38/453 - 8.4%
15/151 38/453 453/453 100% 28.82 MB 36.6 MB 2026-05-22 deepseek-tui-v4-flash-max
22
Step 3.5 flash 2603 OpenCodeStepFun - stepfun/step-3.5-flash-2603
13/151 - 8.6%
5/151 - 3.3%
27/453 - 6.0%
13/151 27/453 453/453 100% 15.61 MB 109.34 MB opencode-stepfun35-2603
Notes: Pass@3 counts tasks solved at least once across 3 attempts. Pass^3 counts tasks solved in all 3 attempts. Attempt Score is solved scoreable attempts divided by 453. Rows are included when the exported summary reports 453 completed and 453 scoreable attempts.

Visual Leaderboard

Switch metrics to compare the same agent/model pairs by reach, consistency, and per-attempt solve rate.

GPT 5.3 codex (xhigh) Codex - OpenAI
GPT 5.4 Codex - OpenAI
GPT 5.5 Codex - OpenAI
Qwen 3.5 plus Qwen - Qwen
DeepSeek v4 flash (max) OpenCode - DeepSeek
MiMo v2.5 pro (high) OpenCode - Xiaomi
MiniMax M2.5 highspeed OpenCode - MiniMax
GLM 5.1 OpenCode - Zhipu GLM
MiMo v2.5 pro Claude Code - Xiaomi
GLM 5 turbo OpenCode - Zhipu GLM
DeepSeek v4 pro (max) OpenCode - DeepSeek
Qwen 3.6 plus Qwen - Qwen
MiMo v2.5 Claude Code - Xiaomi
MiniMax M2.7 highspeed OpenCode - MiniMax
GLM 4.7 OpenCode - Zhipu GLM
MiMo v2.5 OpenCode - Xiaomi
LongCat 2.0 Preview OpenCode - LongCat
SenseNova 6.7 flash lite OpenCode - SenseNova
doubao seed 2.0 code OpenCode - Volcengine
Step 3.5 flash OpenCode - StepFun
DeepSeek v4 flash (max) DeepSeek - DeepSeek
Step 3.5 flash 2603 OpenCode - StepFun

Reach vs Consistency

Pass@3 shows whether an agent can solve a task at least once; Pass^3 shows whether it solves the same task all three times. The gap is useful when comparing stochastic or retry-sensitive agents.

Pass@3
Codex - GPT 5.3 codex (xhigh)
Codex - GPT 5.4
Codex - GPT 5.5
Qwen - Qwen 3.5 plus
OpenCode - DeepSeek v4 flash (max)
OpenCode - MiMo v2.5 pro (high)
OpenCode - MiniMax M2.5 highspeed
OpenCode - GLM 5.1
Claude Code - MiMo v2.5 pro
OpenCode - GLM 5 turbo
OpenCode - DeepSeek v4 pro (max)
Qwen - Qwen 3.6 plus
Claude Code - MiMo v2.5
OpenCode - MiniMax M2.7 highspeed
OpenCode - GLM 4.7
OpenCode - MiMo v2.5
OpenCode - LongCat 2.0 Preview
OpenCode - SenseNova 6.7 flash lite
OpenCode - doubao seed 2.0 code
OpenCode - Step 3.5 flash
DeepSeek - DeepSeek v4 flash (max)
OpenCode - Step 3.5 flash 2603
Pass^3