Rank排名 1 OpenCode Zhipu GLM

GLM 5.2

A fast-rising coding agent with a very clear personality: excellent on Python and ops-shaped repairs, fragile when the task turns into a cross-package product change. 一个性格很清楚的编程 agent:Python 和运维形态修复很强,但遇到跨包产品改动时容易变脆。

opencode-cli 1.17.8 zai-coding-plan/glm-5.2 Updated更新 2026-06-18

How to read this result可以这样读

  • GLM 5.2 is ranked #1 in this export, but not because it dominates every repo. It wins by being very reliable on several high-signal suites.GLM 5.2 在这次导出里排第 1,但不是因为它每个仓库都碾压。它的优势来自几个高信号 suite 上非常稳定的表现。
  • The strongest shape is Python or ops-heavy work with tight tests: Open Library, Ansible, and localized vulnerability-scanner patches.它最强的是 Python 或偏运维的修复:Open Library、Ansible,以及边界清楚的漏洞扫描器改动。
  • The weak shape is broad product plumbing: Flipt authentication/feature-flag changes, qutebrowser's Qt runtime behavior, and Navidrome persistence/API work.它比较弱的是横跨多层的产品工程:Flipt 的认证和 feature flag、qutebrowser 的 Qt 运行时行为、Navidrome 的持久化和 API。
  • The stricter verifier keeps 101 of 140 headline successes, so the rank is real but the margin should be read with audit context.更严格的复核保留了 140 次初始成功中的 101 次,所以排名有含金量,但领先幅度要结合审计来看。

GLM 5.2’s result is interesting because it does not look like a model that is equally good everywhere. It looks like a model with a strong center of gravity. When the benchmark asks for a localized repair, a schema change with obvious tests, or a Python service fix where the failing test points at the right layer, it behaves like a first-place agent. When the task asks it to move a product feature through several packages and keep every consumer aligned, the same model can look surprisingly ordinary.

That is why the rank-1 number is useful, but the shape behind it matters more than the number itself.

Selected high and low suites, grouped by pass-at-least-once rate.选取高分和低分 suite,按三次尝试至少解出一次的比例展示。
vuls · release 012vuls · release 012 4/4 · 100.0%

Localized Go vulnerability-scanner patches; the best clean-room signal in this run.局部 Go 漏洞扫描器修复;这次运行里最干净的强项信号。

openlibrary · release 013openlibrary · release 013 8/10 · 80.0%

Large Python/Django surface where failing tests usually point to the right subsystem.大型 Python/Django 代码面,但失败测试通常能把方向指到正确子系统。

openlibrary · release 015openlibrary · release 015 7/10 · 70.0%

Bibliographic metadata and import-path fixes: broad codebase, but concrete invariants.书目元数据和导入路径修复:代码面很大,但不变量清楚。

ansible · release 004ansible · release 004 2/3 · 66.7%
ansible · release 001ansible · release 001 6/10 · 60.0%
vuls · release 010vuls · release 010 5/10 · 50.0%
flipt · release 008flipt · release 008 2/10 · 20.0%

Feature-flag service work where the patch must land across API, storage, and UI-adjacent boundaries.feature flag 服务里的跨层改动,补丁必须同时落到 API、存储和接近 UI 的边界。

flipt · release 006flipt · release 006 1/10 · 10.0%
navidrome · release 017navidrome · release 017 0/5 · 0.0%

User-facing Go service tasks that combine migrations, repositories, and API behavior.面向用户的 Go 服务任务,往往同时牵涉 migration、repository 和 API 行为。

qutebrowser · release 018qutebrowser · release 018 0/9 · 0.0%

QtWebEngine runtime behavior and browser integration; zero successful tasks in this export.QtWebEngine 运行时行为和浏览器集成;这次导出中没有成功任务。

The best version of GLM 5.2 shows up in two clusters. The first is Open Library, a large Django-era Python codebase with enough tests and naming convention to reward careful search. GLM 5.2 solved 8/10 tasks in release-zh-013 and 7/10 in release-zh-015; more importantly, 12 of those Open Library tasks were stable enough to pass in all three attempts across the two suites.

The second cluster is ops-shaped code: Ansible modules, inventory behavior, and vuls scanner configuration. These are not trivial tasks, but they tend to expose a narrow invariant. In vuls release 012, the model went 4/4. In one representative task, it had to add Docker image digest support through config structs, validation, and scan-result output. That is exactly the kind of patch where GLM 5.2 can read the tests, find the right seam in the codebase, and finish without inventing a larger product design.

The weak side has a different texture. Flipt asks for product plumbing: authentication methods, feature-flag semantics, storage interfaces, API behavior, and UI-adjacent surfaces have to move together. GLM 5.2 often makes the obvious local edit, then misses a second-order consumer. That is why the Flipt releases sit at 10-40%, with several suites around 1-2 solved tasks out of 10.

qutebrowser and Navidrome are even harsher. qutebrowser release 018 is 0/9, and Navidrome release 017 is 0/5. The qutebrowser failures are not about Python syntax; they are about runtime behavior across QtWebEngine versions, dark-mode settings, URL pattern support, and browser state. Navidrome’s misses are Go service changes that combine schema migration, repository interfaces, and HTTP/Subsonic behavior. In both cases, the patch has to remain coherent after it leaves the first file.

Initial harness verdict vs stricter verifier-backed audit初始 harness 判定 vs 更严格的 verifier-backed 复核
101 of 140 headline successes survived strict re-verification. 140 次初始成功里,101 次通过了更严格的复核。

The audit does not knock GLM 5.2 out of the top tier, but it changes the reading: the headline rank is powered by real wins, with a non-trivial band of brittle successes.复核并没有把 GLM 5.2 打出第一梯队,但它改变了读法:榜首来自真实强项,同时也存在一段不可忽略的脆弱成功。

101 verifier-backed复核通过 39 strict rejected严格拒绝
37.59 32.87 -4.72 points-4.72 分

The verifier audit is the main caveat. The headline run reports 140 harness-ok attempts out of 453. Under the stricter re-verification pass, 101 remain verifier-backed, moving the score from 37.59 to 32.87. That does not make the result disposable; it means the right reading is “top-tier, but with a visible brittle band,” not “dominates the suite.”

So if you are choosing an agent, I would treat GLM 5.2 as a strong default for Python services, ops tooling, Ansible-style automation, and localized backend repairs. I would be more cautious when the task is a Go product feature that touches storage, API, auth, and frontend consumers at once, or a browser/runtime integration where the tests encode behavior that is not obvious from static code search.

Supporting suite table
Suite Repo Solved Pass^3 Rate
release-zh-012-future-architect-vuls future-architect/vuls 4/4 3 100.0%
release-zh-013-internetarchive-openlibrary internetarchive/openlibrary 8/10 6 80.0%
release-zh-015-internetarchive-openlibrary internetarchive/openlibrary 7/10 6 70.0%
release-zh-004-ansible-ansible ansible/ansible 2/3 2 66.7%
release-zh-001-ansible-ansible ansible/ansible 6/10 3 60.0%
release-zh-010-future-architect-vuls future-architect/vuls 5/10 3 50.0%
release-zh-008-flipt-io-flipt flipt-io/flipt 2/10 1 20.0%
release-zh-006-flipt-io-flipt flipt-io/flipt 1/10 1 10.0%
release-zh-017-navidrome-navidrome navidrome/navidrome 0/5 0 0.0%
release-zh-018-qutebrowser-qutebrowser qutebrowser/qutebrowser 0/9 0 0.0%

GLM 5.2 这次结果有意思的地方在于,它不像是一个「所有地方都同样强」的模型,而更像是一个重心很明确的模型。benchmark 如果要求的是局部修复、schema 调整、测试指向清楚的 Python 服务问题,它就很像第一名。可一旦任务变成跨多个 package 的产品功能,需要把配置、存储、API、消费者一起对齐,同一个模型就会显得普通很多。

所以 rank 1 这个数字有意义,但更值得看的其实是它背后的能力形状。

Selected high and low suites, grouped by pass-at-least-once rate.选取高分和低分 suite,按三次尝试至少解出一次的比例展示。
vuls · release 012vuls · release 012 4/4 · 100.0%

Localized Go vulnerability-scanner patches; the best clean-room signal in this run.局部 Go 漏洞扫描器修复;这次运行里最干净的强项信号。

openlibrary · release 013openlibrary · release 013 8/10 · 80.0%

Large Python/Django surface where failing tests usually point to the right subsystem.大型 Python/Django 代码面,但失败测试通常能把方向指到正确子系统。

openlibrary · release 015openlibrary · release 015 7/10 · 70.0%

Bibliographic metadata and import-path fixes: broad codebase, but concrete invariants.书目元数据和导入路径修复:代码面很大,但不变量清楚。

ansible · release 004ansible · release 004 2/3 · 66.7%
ansible · release 001ansible · release 001 6/10 · 60.0%
vuls · release 010vuls · release 010 5/10 · 50.0%
flipt · release 008flipt · release 008 2/10 · 20.0%

Feature-flag service work where the patch must land across API, storage, and UI-adjacent boundaries.feature flag 服务里的跨层改动,补丁必须同时落到 API、存储和接近 UI 的边界。

flipt · release 006flipt · release 006 1/10 · 10.0%
navidrome · release 017navidrome · release 017 0/5 · 0.0%

User-facing Go service tasks that combine migrations, repositories, and API behavior.面向用户的 Go 服务任务,往往同时牵涉 migration、repository 和 API 行为。

qutebrowser · release 018qutebrowser · release 018 0/9 · 0.0%

QtWebEngine runtime behavior and browser integration; zero successful tasks in this export.QtWebEngine 运行时行为和浏览器集成;这次导出中没有成功任务。

GLM 5.2 最好的表现集中在两类任务里。第一类是 Open Library。这是一个比较大的 Python/Django 代码库,但测试、命名和对象边界足够明确,能奖励认真搜索和小心改动。GLM 5.2 在 release-zh-013 解出 8/10,在 release-zh-015 解出 7/10;更重要的是,这两个 suite 里有 12 道 Open Library 题稳定到三次尝试都能通过。

第二类是 偏运维形态的代码:Ansible module、inventory 行为、vuls 扫描器配置。这些题并不简单,但往往会暴露一个相对窄的不变量。比如 vuls release 012 里它做到了 4/4。一个代表性题目是给 Docker image 增加 digest 支持,补丁要穿过 config struct、validation 和 scan result 输出。这种题正好落在 GLM 5.2 的舒适区:读测试,找到正确位置,完成改动,不需要发明一套更大的产品设计。

弱点的质感不一样。Flipt 更像产品工程:认证方式、feature flag 语义、存储接口、API 行为,以及接近 UI 的表面要一起动。GLM 5.2 经常能做出最明显的局部编辑,但会漏掉第二层消费者。因此 Flipt 几个 release 大多在 10-40%,有些 suite 只有 10 道题里解出 1 到 2 道。

qutebrowserNavidrome 更苛刻。qutebrowser release 018 是 0/9,Navidrome release 017 是 0/5。qutebrowser 的失败不是 Python 语法问题,而是 QtWebEngine 版本、dark mode 设置、URL pattern、浏览器运行状态要同时成立。Navidrome 的失败则是 Go 服务里的 schema migration、repository interface、HTTP/Subsonic 行为一起变化。两者都要求补丁离开第一处文件后仍然保持一致。

Initial harness verdict vs stricter verifier-backed audit初始 harness 判定 vs 更严格的 verifier-backed 复核
101 of 140 headline successes survived strict re-verification. 140 次初始成功里,101 次通过了更严格的复核。

The audit does not knock GLM 5.2 out of the top tier, but it changes the reading: the headline rank is powered by real wins, with a non-trivial band of brittle successes.复核并没有把 GLM 5.2 打出第一梯队,但它改变了读法:榜首来自真实强项,同时也存在一段不可忽略的脆弱成功。

101 verifier-backed复核通过 39 strict rejected严格拒绝
37.59 32.87 -4.72 points-4.72 分

审计是主要 caveat。初始运行里有 140 次 harness-ok,但在更严格的 verifier-backed 复核下,保留下来的有 101 次,Final Score 从 37.59 变成 32.87。这不代表结果不能用,而是读法要更精确:它确实是第一梯队,而且有真实强项,但不是「全场统治」,中间存在一段明显的脆弱成功。

如果你在选 agent,我会把 GLM 5.2 当成 Python 服务、运维工具、Ansible 风格自动化、局部后端修复 的强默认选择。但如果任务是 Go 产品功能,同时触碰 storage、API、auth 和前端消费者,或者是浏览器/runtime 集成,测试里编码了很多静态搜索看不出来的行为,我会更谨慎。

支撑这个判断的 suite 表
Suite Repo 解出 Pass^3 通过率
release-zh-012-future-architect-vuls future-architect/vuls 4/4 3 100.0%
release-zh-013-internetarchive-openlibrary internetarchive/openlibrary 8/10 6 80.0%
release-zh-015-internetarchive-openlibrary internetarchive/openlibrary 7/10 6 70.0%
release-zh-004-ansible-ansible ansible/ansible 2/3 2 66.7%
release-zh-001-ansible-ansible ansible/ansible 6/10 3 60.0%
release-zh-010-future-architect-vuls future-architect/vuls 5/10 3 50.0%
release-zh-008-flipt-io-flipt flipt-io/flipt 2/10 1 20.0%
release-zh-006-flipt-io-flipt flipt-io/flipt 1/10 1 10.0%
release-zh-017-navidrome-navidrome navidrome/navidrome 0/5 0 0.0%
release-zh-018-qutebrowser-qutebrowser qutebrowser/qutebrowser 0/9 0 0.0%