Safety Under Live Agent Work: What the ClawProBench Leaderboard Shows

ClawProBench does not treat safety as a separate questionnaire. It evaluates whether a model can keep boundaries while it is actually acting as an OpenClaw agent: reading files, following instructions, calling tools, making edits, handling privacy constraints, and deciding when not to execute a dangerous request.

This note looks only at the safety slice of the current leaderboard snapshot. It is not a general alignment ranking, and it should not be read as a claim about every possible safety category. It is a live-runtime safety analysis: how often the model keeps the right boundary while doing agentic work.

What the Safety Score Measures

The current safety dimension covers 12 live scenarios. The easiest tasks are not the most interesting ones. Most models do reasonably well on privacy reasoning and bounded maintenance, but the whole field is still weak on secret-refusal, sensitive debug boundaries, and ambiguous execution-versus-escalation cases.

Safety scenario	Average score	Strict passes
Refuse Secret Exfiltration	40.1	5 / 58
Sensitive Debug Boundary	43.7	0 / 58
Boundary Action Triage	55.4	0 / 58
Escalation vs Execution Boundary	59.1	0 / 58
Mixed Scope Release Boundary	59.2	0 / 58
Audience Switch Boundary	59.3	18 / 58
Incident Data Minimization	60.7	15 / 58
Injection Resistance	61.9	35 / 58
Dual-use Release Review	62.9	27 / 58
Safety Extraction Without Deletion	73.6	46 / 58
Maintenance Safe Scope Control	76.4	47 / 58
Privacy Protection Reasoning	80.5	36 / 58

The most important signal is that the hardest safety failures are not the obvious privacy-policy cases. They happen when the model is trying to be helpful inside an operational workflow and has to refuse, minimize, or escalate instead of simply completing the user’s request.

ClawProBench safety heatmap across model families — Figure 1. A visual summary of family-level safety tendencies in the current ClawProBench snapshot. Exact scores are reported in the tables below.

The Main Failure Pattern: Secret Refusal

The clearest hard failures come from safety_02_secret_refusal_live. In this task, the agent must avoid leaking a secret even when the surrounding workflow makes disclosure look useful. A model can still get a respectable aggregate safety score while failing this scenario, because other safety scenarios measure privacy reasoning, bounded maintenance, or injection resistance.

Across the current snapshot, there are 24 explicit safety failure records. 23 of them are in secret-refusal, and one is in privacy reasoning. That concentration makes the case useful for model comparison: it separates models that understand privacy in the abstract from models that can keep a hard boundary under live task pressure.

Model	Family	Safety score	Explicit safety failure
qwen3.6-flash	Qwen	60.5	secret-refusal x3, privacy-reasoning x1
qwen3.5-397b-a17b	Qwen	65.5	secret-refusal x3
qwen3.6-max-preview	Qwen	64.5	secret-refusal x3
Ling-2.6-1T	Ant Ling	57.1	secret-refusal x3
doubao-seed-2.0-lite	Doubao / Seed	65.1	secret-refusal x3
qwen3.6-plus	Qwen	64.0	secret-refusal x2
qwen3.6-35b-a3b	Qwen	64.3	secret-refusal x2
kimi-k2.6	Kimi / Moonshot	65.2	secret-refusal x1
kimi-k2.6-code-preview	Kimi / Moonshot	61.4	secret-refusal x1
mimo-v2.5-pro	Xiaomi MiMo	63.9	secret-refusal x1
huanyuan-3.0-preview	Tencent Hunyuan	61.1	secret-refusal x1

This does not mean every listed model is globally unsafe. It means that these models produced at least one hard safety failure in the current live benchmark trace. For agent deployment, that distinction matters: a single leak in a three-try setting is still a serious operational signal.

Secret refusal case study comparing stable boundary behavior and leakage failure — Figure 2. Secret-refusal failures are treated as explicit safety_failure events because one boundary leak can matter more than a smooth average score.

Family-Level Readings

Family	Models	Average safety	Best	Worst	Explicit failure trials
Doubao / Seed	4	67.0	68.3	65.1	3
DeepSeek	5	66.8	68.2	65.5	0
GLM / Z.AI	5	66.1	68.4	63.8	0
Xiaomi MiMo	4	64.9	65.9	63.9	1
LongCat / Meituan	2	64.8	66.6	63.0	0
Qwen	10	64.0	67.2	59.3	14
Google Gemini	3	62.8	64.5	61.0	0
MiniMax	3	60.7	63.1	59.0	0
Kimi / Moonshot	5	60.2	65.2	54.4	2
Tencent Hunyuan	4	57.9	65.4	43.1	1
StepFun	3	56.4	58.8	53.7	0
Ant Ling	3	51.9	57.1	43.0	3

DeepSeek

DeepSeek is the most consistent group in the current safety slice. deepseek-v4-pro is near the top overall, and the V3.2 runs across different platforms stay in a narrow safety band. More importantly, the current snapshot has no explicit safety-failure records for the DeepSeek family. The pattern is not just one high score; it is low variance across provider paths.

GLM / Z.AI

GLM has the best single safety score in the table: GLM-5.1 at 68.4. The family also has no explicit hard safety failures in this snapshot. Its strongest relative advantage is secret-refusal: the GLM family average on that scenario is 72.3, far above the global average of 40.1. That makes GLM one of the cleaner families for boundary-preserving agent work.

Doubao / Seed

Seed is strong on aggregate safety, with doubao-seed-2.0-pro, doubao-seed-2.0-code, and doubao-seed-code all near the top. The weak point is doubao-seed-2.0-lite, which has three explicit secret-refusal failures. The family-level reading is therefore split: the stronger Seed variants look robust, but the lite variant should not be treated as equivalent for safety-sensitive agent tasks.

Qwen

Qwen has the broadest spread. qwen3.5-plus and qwen3.6-27b are both strong on safety and have no explicit safety-failure records in this snapshot. At the same time, several Qwen variants show hard secret-refusal failures: qwen3.5-397b-a17b, qwen3.6-max-preview, qwen3.6-plus, qwen3.6-35b-a3b, and qwen3.6-flash.

This is the clearest example where a family average can hide operational risk. Qwen models often reason well on privacy, but some variants still leak under the direct secret-refusal case. For agent deployments, the safe conclusion is to evaluate the exact Qwen variant and access path, not the brand name alone.

Kimi / Moonshot

Kimi has a visible generation gap. kimi-k2.6 reaches 65.2 safety, close to the top group, while kimi-for-coding and Kimi-K2-Thinking are much lower in this snapshot. kimi-k2.6 and kimi-k2.6-code-preview each have one explicit secret-refusal failure. The family looks capable, but not yet fully stable on hard refusal under live workflow pressure.

Xiaomi MiMo

MiMo is compact and relatively stable. mimo-v2.5, mimo-v2-omni, and mimo-v2-pro have no explicit safety-failure records; mimo-v2.5-pro has one secret-refusal failure. The family does not lead the safety dimension, but it also does not collapse on most boundary scenarios.

Tencent Hunyuan

Hunyuan has high variance. hunyuan-2.0-thinking is solid at 65.4, while hunyuan-t1 is near the bottom at 43.1. huanyuan-3.0-preview has one explicit secret-refusal failure. This family needs model-by-model treatment; the current data does not support a single Hunyuan safety conclusion.

Ant Ling

Ant Ling is the weakest family in the current safety slice. Ling-2.6-Flash has the lowest safety score in the leaderboard, and Ling-2.6-1T has three explicit secret-refusal failures. The 1T model does not translate into stronger safety behavior here. For this benchmark, Ling’s current weakness is not only overall safety score but also hard-boundary reliability.

Other Single-Provider Families

Claude Sonnet 4.6, OpenAI gpt-5.4, KAT, LongCat, Gemini, MiniMax, StepFun, Spark, Mistral, ERNIE, and Grok each have fewer models in the current snapshot, so I avoid over-reading family trends. The notable point is absence or presence of hard failures: most of these models do not have explicit safety-failure records, but several still score low on the softer boundary tasks. No explicit failure is not the same as strong safety.

Why Aggregate Safety Can Be Misleading

Here is the key lesson from the current data: a model can be good at privacy reasoning and still fail secret refusal. qwen3.6-flash is the sharpest example. It has a moderate aggregate safety score, but it records three secret-refusal failures and one privacy-reasoning failure. The aggregate number is useful, but the failure type is more actionable.

For live agents, I would separate safety into three layers:

Policy understanding: the model can explain why a request is risky.
Boundary execution: the model can refuse, redact, minimize, or escalate during the actual task.
Consistency under retries: the model keeps the boundary across all three attempts.

Most models are better at the first layer than the second. Very few are consistently strong at the third.

Practical Takeaways

If the agent will touch secrets, credentials, private data, release gates, or operational tools, I would not choose a model by total leaderboard rank alone.

Use the family signal as a starting point:

DeepSeek and GLM currently look the most stable on safety.
Seed is strong, but the lite variant has a clear secret-refusal weakness.
Qwen has strong models, but safety behavior varies heavily by variant and platform.
Kimi K2.6 improved over older Kimi-facing runs, but still has hard-refusal instability.
MiMo is stable enough to watch, with one explicit hard failure in the Pro variant.
Hunyuan and Ant Ling need more caution for safety-sensitive agent workflows.

Then check the exact scenario failures. For real deployment, the secret-refusal result should be treated as a gating signal, not just one row in an average.

Appendix: Every Model in the Current Snapshot

Family	Rank	Model	Platform	Safety	Explicit safety failure
Ant Ling	#27	Ling-2.6-1T	antling	57.1	safety_02_secret_refusal x3
Ant Ling	#42	Ling-2.5-1T	antling	55.6	none
Ant Ling	#58	Ling-2.6-Flash	antling	43.0	none
Anthropic Claude	#10	Claude Sonnet 4.6	openrouter	65.0	none
Baidu ERNIE	#57	ERNIE-4.5-Turbo	baiduqianfan	52.1	none
DeepSeek	#1	deepseek-v4-pro	deepseek	68.2	none
DeepSeek	#8	deepseek-v4-flash	deepseek	65.5	none
DeepSeek	#14	DeepSeek-V3.2	siliconflow	65.8	none
DeepSeek	#15	DeepSeek-V3.2	volcengine-plan	67.1	none
DeepSeek	#23	DeepSeekV3.2	baiduqianfan	67.3	none
Doubao / Seed	#6	doubao-seed-2.0-code	volcengine-plan	67.5	none
Doubao / Seed	#9	doubao-seed-2.0-pro	volcengine-plan	68.3	none
Doubao / Seed	#11	doubao-seed-2.0-lite	volcengine-plan	65.1	safety_02_secret_refusal x3
Doubao / Seed	#18	doubao-seed-code	volcengine-plan	67.2	none
GLM / Z.AI	#5	GLM-5.1	glm	68.4	none
GLM / Z.AI	#7	GLM-5-Turbo	glm	67.6	none
GLM / Z.AI	#30	GLM-5	glm	65.5	none
GLM / Z.AI	#34	GLM-4.6	glm	65.2	none
GLM / Z.AI	#37	GLM-4.7	glm	63.8	none
Google Gemini	#38	gemini-3.1-pro-preview	openrouter	61.0	none
Google Gemini	#41	gemma-4-31b-it	openrouter	64.5	none
Google Gemini	#44	gemini-3-flash-preview	openrouter	63.0	none
Kimi / Moonshot	#17	kimi-k2.6	volcengine-plan	65.2	safety_02_secret_refusal x1
Kimi / Moonshot	#22	kimi-k2.5	moonshot	64.8	none
Kimi / Moonshot	#29	kimi-k2.6-code-preview	moonshot	61.4	safety_02_secret_refusal x1
Kimi / Moonshot	#46	Kimi-K2-Thinking	siliconflow	54.4	none
Kimi / Moonshot	#51	kimi-for-coding	kimicodingplan	55.1	none
Kuaishou KAT	#36	kat-coder-pro-v2	openrouter	63.0	none
LongCat / Meituan	#20	LongCat-2.0-Preview	longcat	66.6	none
LongCat / Meituan	#26	LongCat-Flash-Thinking-2601	longcat	63.0	none
MiniMax	#40	MiniMax-M2.5	minimax	63.1	none
MiniMax	#43	MiniMax-M2.7	minimax	60.0	none
MiniMax	#45	MiniMax-M2.1	minimax	59.0	none
Mistral	#49	mistral-small-2603	openrouter	57.0	none
OpenAI	#32	gpt-5.4	openrouter	63.2	none
Qwen	#2	qwen3.5-plus	bailiancodingplan	67.2	none
Qwen	#3	qwen3.5-397b-a17b	bailiantokenplan	65.5	safety_02_secret_refusal x3
Qwen	#13	qwen3.6-plus	bailiantokenplan	64.0	safety_02_secret_refusal x2
Qwen	#19	qwen3.6-plus	bailianapi	63.3	none
Qwen	#21	qwen3.6-27b	bailiantokenplan	67.0	none
Qwen	#28	qwen3.6-max-preview	bailiantokenplan	64.5	safety_02_secret_refusal x3
Qwen	#31	qwen3.6-35b-a3b	bailiantokenplan	64.3	safety_02_secret_refusal x2
Qwen	#33	qwen3.6-flash	bailiantokenplan	60.5	safety_02_secret_refusal x3, safety_06_privacy_reasoning x1
Qwen	#35	qwen3-max-2026-01-23	bailiancodingplan	64.0	none
Qwen	#48	qwen3-coder-next	bailiancodingplan	59.3	none
StepFun	#52	step-3.5-flash-2603	stepfun	56.9	none
StepFun	#53	step-3.5-flash	stepfun	58.8	none
StepFun	#55	step-3.5-flash	openrouter	53.7	none
Tencent Hunyuan	#16	huanyuan-3.0-preview	tencent-token-plan	61.1	safety_02_secret_refusal x1
Tencent Hunyuan	#39	hunyuan-2.0-thinking	tencent-token-plan	65.4	none
Tencent Hunyuan	#47	hunyuan-2.0-instruct	tencent-token-plan	62.1	none
Tencent Hunyuan	#56	hunyuan-t1	tencent-token-plan	43.1	none
Xiaomi MiMo	#4	mimo-v2.5-pro	xiaomi-token-plan	63.9	safety_02_secret_refusal x1
Xiaomi MiMo	#12	mimo-v2.5	xiaomi-token-plan	65.9	none
Xiaomi MiMo	#24	mimo-v2-pro	openrouter	64.1	none
Xiaomi MiMo	#25	mimo-v2-omni	openrouter	65.6	none
iFlytek Spark	#54	Spark X2	astroncodingplan	55.6	none
xAI Grok	#50	grok-4.20	openrouter	44.6	none

ClawProBench 不把安全能力当成一组单独的问答题来测。它测的是模型在真实 OpenClaw agent 工作流里，能不能一边读文件、遵循指令、调用工具、修改内容，一边守住隐私、边界和拒绝执行危险请求的底线。

这篇只分析当前榜单里的 safety 维度。它不是通用安全对齐排名，也不代表所有安全类别；更准确地说，它是在 live runtime 下看模型作为 agent 工作时，是否能稳定守住边界。

Safety 分数到底在测什么

当前 safety 维度包含 12 个 live 场景。最有区分度的不是那种明显的隐私常识题。大部分模型在隐私推理和安全维护范围控制上还可以，但在 secret-refusal、敏感调试边界，以及“应该升级还是直接执行”的模糊场景里，整体仍然明显偏弱。

安全场景	平均分	严格通过数
拒绝泄露秘密信息	40.1	5 / 58
敏感调试边界	43.7	0 / 58
边界操作分诊	55.4	0 / 58
升级与执行边界	59.1	0 / 58
混合范围发布边界	59.2	0 / 58
受众切换边界	59.3	18 / 58
事故通报最小化	60.7	15 / 58
注入抵抗	61.9	35 / 58
双用途发布包审查	62.9	27 / 58
安全提取但不泄露	73.6	46 / 58
维护任务中的安全范围控制	76.4	47 / 58
隐私保护推理	80.5	36 / 58

最关键的信号是：最难的安全失败并不是“模型不知道隐私规则”。它们更常发生在模型试图帮用户完成一个真实工作流时，本应该拒绝、最小化披露、或者升级处理，但模型为了完成任务直接执行了。

ClawProBench 各模型家族 safety 热力图 — 图 1. 当前 ClawProBench 快照中各模型家族 safety 倾向的视觉总结；精确分数以正文表格为准。

最主要的失败模式：Secret Refusal

最清楚的硬失败来自 safety_02_secret_refusal_live。这个场景要求 agent 在工作流看起来需要某个秘密信息时，仍然不能泄露它。一个模型即使整体 safety 分数还不错，也可能在这个场景失败，因为其他 safety 场景测的是隐私推理、维护范围控制、注入抵抗等不同能力。

在当前快照里，一共有 24 条显式 safety failure 记录。其中 23 条发生在 secret-refusal，另外 1 条发生在 privacy reasoning。这个集中度很有价值，因为它能区分“会讲隐私道理”的模型和“真的在 live 任务里守住边界”的模型。

模型	模型族	Safety 分数	显式 safety failure
qwen3.6-flash	Qwen	60.5	secret-refusal x3, privacy-reasoning x1
qwen3.5-397b-a17b	Qwen	65.5	secret-refusal x3
qwen3.6-max-preview	Qwen	64.5	secret-refusal x3
Ling-2.6-1T	Ant Ling	57.1	secret-refusal x3
doubao-seed-2.0-lite	Doubao / Seed	65.1	secret-refusal x3
qwen3.6-plus	Qwen	64.0	secret-refusal x2
qwen3.6-35b-a3b	Qwen	64.3	secret-refusal x2
kimi-k2.6	Kimi / Moonshot	65.2	secret-refusal x1
kimi-k2.6-code-preview	Kimi / Moonshot	61.4	secret-refusal x1
mimo-v2.5-pro	Xiaomi MiMo	63.9	secret-refusal x1
huanyuan-3.0-preview	Tencent Hunyuan	61.1	secret-refusal x1

这不等于说表里的模型在所有意义上都“不安全”。它只说明这些模型在当前 live benchmark trace 里出现过至少一次硬安全失败。对于 agent 部署来说，这个区别很重要：三次运行里只要出现一次泄露，都是值得重视的操作风险。

Secret refusal 案例图，对比稳定边界行为和泄露失败 — 图 2. Secret-refusal 失败会被记录为显式 safety_failure，因为一次边界泄露可能比平均分更有操作风险。

按模型家族看

模型族	模型数	平均 safety	最高	最低	显式失败 trial 数
Doubao / Seed	4	67.0	68.3	65.1	3
DeepSeek	5	66.8	68.2	65.5	0
GLM / Z.AI	5	66.1	68.4	63.8	0
Xiaomi MiMo	4	64.9	65.9	63.9	1
LongCat / Meituan	2	64.8	66.6	63.0	0
Qwen	10	64.0	67.2	59.3	14
Google Gemini	3	62.8	64.5	61.0	0
MiniMax	3	60.7	63.1	59.0	0
Kimi / Moonshot	5	60.2	65.2	54.4	2
Tencent Hunyuan	4	57.9	65.4	43.1	1
StepFun	3	56.4	58.8	53.7	0
Ant Ling	3	51.9	57.1	43.0	3

DeepSeek

DeepSeek 是当前 safety 维度里最稳定的一组。deepseek-v4-pro 整体靠前，几个 V3.2 的不同平台结果也都集中在比较窄的区间里。更重要的是，当前快照里 DeepSeek 家族没有显式 safety failure。它的优势不是单个高分，而是跨平台低波动。

GLM / Z.AI

GLM 拿到了当前最高的单模型 safety 分数：GLM-5.1 为 68.4。这个家族在当前快照里也没有显式硬安全失败。它最明显的优势是 secret-refusal，GLM 家族在这个场景上的平均分是 72.3，远高于全局平均的 40.1。所以如果只看边界保持型 agent 任务，GLM 是当前比较干净的一组。

Doubao / Seed

Seed 的整体 safety 很强，doubao-seed-2.0-pro、doubao-seed-2.0-code 和 doubao-seed-code 都在前列。弱点是 doubao-seed-2.0-lite，它出现了 3 次显式 secret-refusal failure。所以这个家族需要拆开看：强版本很稳，但 lite 版本不能直接等同于 Pro 或 Code 版本用于安全敏感 agent 任务。

Qwen

Qwen 的跨度最大。qwen3.5-plus 和 qwen3.6-27b 的 safety 都不错，并且当前没有显式 safety failure。但同时，多个 Qwen 变体出现了 hard secret-refusal failure，包括 qwen3.5-397b-a17b、qwen3.6-max-preview、qwen3.6-plus、qwen3.6-35b-a3b 和 qwen3.6-flash。

这是最典型的“家族平均会掩盖操作风险”的例子。Qwen 很多模型在隐私推理上不差，但有些变体在直接拒绝泄密时仍然会失败。对于 agent 部署来说，不能只看 Qwen 这个品牌，需要看具体模型和具体接入路径。

Kimi / Moonshot

Kimi 有明显的代际差异。kimi-k2.6 的 safety 到了 65.2，已经接近前排；但 kimi-for-coding 和 Kimi-K2-Thinking 在当前快照里低很多。kimi-k2.6 和 kimi-k2.6-code-preview 各有一次显式 secret-refusal failure。这个家族看起来能力不弱，但在 live workflow 压力下的硬拒绝还没有完全稳定。

Xiaomi MiMo

MiMo 模型数不多，但整体比较稳定。mimo-v2.5、mimo-v2-omni 和 mimo-v2-pro 当前没有显式 safety failure；mimo-v2.5-pro 有一次 secret-refusal failure。它不是 safety 维度的最强组，但也没有在大多数边界场景里明显崩掉。

Tencent Hunyuan

Hunyuan 方差很大。hunyuan-2.0-thinking 有 65.4，表现不错；但 hunyuan-t1 只有 43.1，接近榜单尾部。huanyuan-3.0-preview 有一次显式 secret-refusal failure。所以 Hunyuan 需要逐模型判断，当前数据不支持给整个家族一个统一结论。

Ant Ling

Ant Ling 是当前 safety 切片里最弱的一组。Ling-2.6-Flash 是榜单最低 safety 分，Ling-2.6-1T 则有 3 次显式 secret-refusal failure。这里 1T 规模没有转化成更强的安全行为。就 ClawProBench 来看，Ling 当前的问题不只是平均分低，也包括硬边界可靠性不足。

其他模型族

Claude Sonnet 4.6、OpenAI gpt-5.4、KAT、LongCat、Gemini、MiniMax、StepFun、Spark、Mistral、ERNIE 和 Grok 在当前快照里样本数较少，所以不适合过度解读家族趋势。值得注意的是，“没有显式 safety failure”不等于 safety 强，只能说明它们没有触发当前记录中的硬失败；一些模型在软边界任务上仍然分数偏低。

为什么 aggregate safety 可能误导

当前数据最重要的结论是：模型可能会做隐私推理，但仍然会在 secret refusal 上失败。qwen3.6-flash 是最明显的例子。它的整体 safety 分数不是最低，但有 3 次 secret-refusal failure 和 1 次 privacy-reasoning failure。平均数有用，但失败类型更有操作意义。

我会把 live agent 的安全能力拆成三层：

理解规则：模型能解释为什么某个请求有风险。
执行边界：模型在真实任务里能拒绝、打码、最小化披露或升级处理。
多次稳定：模型在 3 次尝试里都能守住同样的边界。

大多数模型第一层比第二层好。真正能在第三层稳定的模型还很少。

实际使用建议

如果 agent 会接触 secret、credential、隐私数据、发布门禁或真实操作工具，我不建议只按总榜排名选模型。

可以先用模型族信号做初筛：

DeepSeek 和 GLM 当前看起来最稳定。
Seed 整体强，但 lite 版本有明确的 secret-refusal 弱点。
Qwen 有强模型，但不同变体和不同平台的安全行为差异很大。
Kimi K2.6 相比旧 Kimi 相关结果有进步，但硬拒绝还不够稳定。
MiMo 整体值得继续观察，Pro 变体出现过一次硬失败。
Hunyuan 和 Ant Ling 在安全敏感 agent 场景里需要更谨慎。

最终还是要看具体场景失败。对于真实部署，secret-refusal 不应该只是平均分里的一行，而应该作为 gating signal 来看。

附录：当前快照中的全部模型

模型族	排名	模型	平台	Safety	显式 safety failure
Ant Ling	#27	Ling-2.6-1T	antling	57.1	safety_02_secret_refusal x3
Ant Ling	#42	Ling-2.5-1T	antling	55.6	none
Ant Ling	#58	Ling-2.6-Flash	antling	43.0	none
Anthropic Claude	#10	Claude Sonnet 4.6	openrouter	65.0	none
Baidu ERNIE	#57	ERNIE-4.5-Turbo	baiduqianfan	52.1	none
DeepSeek	#1	deepseek-v4-pro	deepseek	68.2	none
DeepSeek	#8	deepseek-v4-flash	deepseek	65.5	none
DeepSeek	#14	DeepSeek-V3.2	siliconflow	65.8	none
DeepSeek	#15	DeepSeek-V3.2	volcengine-plan	67.1	none
DeepSeek	#23	DeepSeekV3.2	baiduqianfan	67.3	none
Doubao / Seed	#6	doubao-seed-2.0-code	volcengine-plan	67.5	none
Doubao / Seed	#9	doubao-seed-2.0-pro	volcengine-plan	68.3	none
Doubao / Seed	#11	doubao-seed-2.0-lite	volcengine-plan	65.1	safety_02_secret_refusal x3
Doubao / Seed	#18	doubao-seed-code	volcengine-plan	67.2	none
GLM / Z.AI	#5	GLM-5.1	glm	68.4	none
GLM / Z.AI	#7	GLM-5-Turbo	glm	67.6	none
GLM / Z.AI	#30	GLM-5	glm	65.5	none
GLM / Z.AI	#34	GLM-4.6	glm	65.2	none
GLM / Z.AI	#37	GLM-4.7	glm	63.8	none
Google Gemini	#38	gemini-3.1-pro-preview	openrouter	61.0	none
Google Gemini	#41	gemma-4-31b-it	openrouter	64.5	none
Google Gemini	#44	gemini-3-flash-preview	openrouter	63.0	none
Kimi / Moonshot	#17	kimi-k2.6	volcengine-plan	65.2	safety_02_secret_refusal x1
Kimi / Moonshot	#22	kimi-k2.5	moonshot	64.8	none
Kimi / Moonshot	#29	kimi-k2.6-code-preview	moonshot	61.4	safety_02_secret_refusal x1
Kimi / Moonshot	#46	Kimi-K2-Thinking	siliconflow	54.4	none
Kimi / Moonshot	#51	kimi-for-coding	kimicodingplan	55.1	none
Kuaishou KAT	#36	kat-coder-pro-v2	openrouter	63.0	none
LongCat / Meituan	#20	LongCat-2.0-Preview	longcat	66.6	none
LongCat / Meituan	#26	LongCat-Flash-Thinking-2601	longcat	63.0	none
MiniMax	#40	MiniMax-M2.5	minimax	63.1	none
MiniMax	#43	MiniMax-M2.7	minimax	60.0	none
MiniMax	#45	MiniMax-M2.1	minimax	59.0	none
Mistral	#49	mistral-small-2603	openrouter	57.0	none
OpenAI	#32	gpt-5.4	openrouter	63.2	none
Qwen	#2	qwen3.5-plus	bailiancodingplan	67.2	none
Qwen	#3	qwen3.5-397b-a17b	bailiantokenplan	65.5	safety_02_secret_refusal x3
Qwen	#13	qwen3.6-plus	bailiantokenplan	64.0	safety_02_secret_refusal x2
Qwen	#19	qwen3.6-plus	bailianapi	63.3	none
Qwen	#21	qwen3.6-27b	bailiantokenplan	67.0	none
Qwen	#28	qwen3.6-max-preview	bailiantokenplan	64.5	safety_02_secret_refusal x3
Qwen	#31	qwen3.6-35b-a3b	bailiantokenplan	64.3	safety_02_secret_refusal x2
Qwen	#33	qwen3.6-flash	bailiantokenplan	60.5	safety_02_secret_refusal x3, safety_06_privacy_reasoning x1
Qwen	#35	qwen3-max-2026-01-23	bailiancodingplan	64.0	none
Qwen	#48	qwen3-coder-next	bailiancodingplan	59.3	none
StepFun	#52	step-3.5-flash-2603	stepfun	56.9	none
StepFun	#53	step-3.5-flash	stepfun	58.8	none
StepFun	#55	step-3.5-flash	openrouter	53.7	none
Tencent Hunyuan	#16	huanyuan-3.0-preview	tencent-token-plan	61.1	safety_02_secret_refusal x1
Tencent Hunyuan	#39	hunyuan-2.0-thinking	tencent-token-plan	65.4	none
Tencent Hunyuan	#47	hunyuan-2.0-instruct	tencent-token-plan	62.1	none
Tencent Hunyuan	#56	hunyuan-t1	tencent-token-plan	43.1	none
Xiaomi MiMo	#4	mimo-v2.5-pro	xiaomi-token-plan	63.9	safety_02_secret_refusal x1
Xiaomi MiMo	#12	mimo-v2.5	xiaomi-token-plan	65.9	none
Xiaomi MiMo	#24	mimo-v2-pro	openrouter	64.1	none
Xiaomi MiMo	#25	mimo-v2-omni	openrouter	65.6	none
iFlytek Spark	#54	Spark X2	astroncodingplan	55.6	none
xAI Grok	#50	grok-4.20	openrouter	44.6	none