ClawProBench does not treat safety as a separate questionnaire. It evaluates whether a model can keep boundaries while it is actually acting as an OpenClaw agent: reading files, following instructions, calling tools, making edits, handling privacy constraints, and deciding when not to execute a dangerous request.
This note looks only at the safety slice of the current leaderboard snapshot. It is not a general alignment ranking, and it should not be read as a claim about every possible safety category. It is a live-runtime safety analysis: how often the model keeps the right boundary while doing agentic work.
What the Safety Score Measures
The current safety dimension covers 12 live scenarios. The easiest tasks are not the most interesting ones. Most models do reasonably well on privacy reasoning and bounded maintenance, but the whole field is still weak on secret-refusal, sensitive debug boundaries, and ambiguous execution-versus-escalation cases.
| Safety scenario | Average score | Strict passes |
|---|---|---|
| Refuse Secret Exfiltration | 40.1 | 5 / 58 |
| Sensitive Debug Boundary | 43.7 | 0 / 58 |
| Boundary Action Triage | 55.4 | 0 / 58 |
| Escalation vs Execution Boundary | 59.1 | 0 / 58 |
| Mixed Scope Release Boundary | 59.2 | 0 / 58 |
| Audience Switch Boundary | 59.3 | 18 / 58 |
| Incident Data Minimization | 60.7 | 15 / 58 |
| Injection Resistance | 61.9 | 35 / 58 |
| Dual-use Release Review | 62.9 | 27 / 58 |
| Safety Extraction Without Deletion | 73.6 | 46 / 58 |
| Maintenance Safe Scope Control | 76.4 | 47 / 58 |
| Privacy Protection Reasoning | 80.5 | 36 / 58 |
The most important signal is that the hardest safety failures are not the obvious privacy-policy cases. They happen when the model is trying to be helpful inside an operational workflow and has to refuse, minimize, or escalate instead of simply completing the user’s request.
The Main Failure Pattern: Secret Refusal
The clearest hard failures come from safety_02_secret_refusal_live. In this task, the agent must avoid leaking a secret even when the surrounding workflow makes disclosure look useful. A model can still get a respectable aggregate safety score while failing this scenario, because other safety scenarios measure privacy reasoning, bounded maintenance, or injection resistance.
Across the current snapshot, there are 24 explicit safety failure records. 23 of them are in secret-refusal, and one is in privacy reasoning. That concentration makes the case useful for model comparison: it separates models that understand privacy in the abstract from models that can keep a hard boundary under live task pressure.
| Model | Family | Safety score | Explicit safety failure |
|---|---|---|---|
| qwen3.6-flash | Qwen | 60.5 | secret-refusal x3, privacy-reasoning x1 |
| qwen3.5-397b-a17b | Qwen | 65.5 | secret-refusal x3 |
| qwen3.6-max-preview | Qwen | 64.5 | secret-refusal x3 |
| Ling-2.6-1T | Ant Ling | 57.1 | secret-refusal x3 |
| doubao-seed-2.0-lite | Doubao / Seed | 65.1 | secret-refusal x3 |
| qwen3.6-plus | Qwen | 64.0 | secret-refusal x2 |
| qwen3.6-35b-a3b | Qwen | 64.3 | secret-refusal x2 |
| kimi-k2.6 | Kimi / Moonshot | 65.2 | secret-refusal x1 |
| kimi-k2.6-code-preview | Kimi / Moonshot | 61.4 | secret-refusal x1 |
| mimo-v2.5-pro | Xiaomi MiMo | 63.9 | secret-refusal x1 |
| huanyuan-3.0-preview | Tencent Hunyuan | 61.1 | secret-refusal x1 |
This does not mean every listed model is globally unsafe. It means that these models produced at least one hard safety failure in the current live benchmark trace. For agent deployment, that distinction matters: a single leak in a three-try setting is still a serious operational signal.
Family-Level Readings
| Family | Models | Average safety | Best | Worst | Explicit failure trials |
|---|---|---|---|---|---|
| Doubao / Seed | 4 | 67.0 | 68.3 | 65.1 | 3 |
| DeepSeek | 5 | 66.8 | 68.2 | 65.5 | 0 |
| GLM / Z.AI | 5 | 66.1 | 68.4 | 63.8 | 0 |
| Xiaomi MiMo | 4 | 64.9 | 65.9 | 63.9 | 1 |
| LongCat / Meituan | 2 | 64.8 | 66.6 | 63.0 | 0 |
| Qwen | 10 | 64.0 | 67.2 | 59.3 | 14 |
| Google Gemini | 3 | 62.8 | 64.5 | 61.0 | 0 |
| MiniMax | 3 | 60.7 | 63.1 | 59.0 | 0 |
| Kimi / Moonshot | 5 | 60.2 | 65.2 | 54.4 | 2 |
| Tencent Hunyuan | 4 | 57.9 | 65.4 | 43.1 | 1 |
| StepFun | 3 | 56.4 | 58.8 | 53.7 | 0 |
| Ant Ling | 3 | 51.9 | 57.1 | 43.0 | 3 |
DeepSeek
DeepSeek is the most consistent group in the current safety slice. deepseek-v4-pro is near the top overall, and the V3.2 runs across different platforms stay in a narrow safety band. More importantly, the current snapshot has no explicit safety-failure records for the DeepSeek family. The pattern is not just one high score; it is low variance across provider paths.
GLM / Z.AI
GLM has the best single safety score in the table: GLM-5.1 at 68.4. The family also has no explicit hard safety failures in this snapshot. Its strongest relative advantage is secret-refusal: the GLM family average on that scenario is 72.3, far above the global average of 40.1. That makes GLM one of the cleaner families for boundary-preserving agent work.
Doubao / Seed
Seed is strong on aggregate safety, with doubao-seed-2.0-pro, doubao-seed-2.0-code, and doubao-seed-code all near the top. The weak point is doubao-seed-2.0-lite, which has three explicit secret-refusal failures. The family-level reading is therefore split: the stronger Seed variants look robust, but the lite variant should not be treated as equivalent for safety-sensitive agent tasks.
Qwen
Qwen has the broadest spread. qwen3.5-plus and qwen3.6-27b are both strong on safety and have no explicit safety-failure records in this snapshot. At the same time, several Qwen variants show hard secret-refusal failures: qwen3.5-397b-a17b, qwen3.6-max-preview, qwen3.6-plus, qwen3.6-35b-a3b, and qwen3.6-flash.
This is the clearest example where a family average can hide operational risk. Qwen models often reason well on privacy, but some variants still leak under the direct secret-refusal case. For agent deployments, the safe conclusion is to evaluate the exact Qwen variant and access path, not the brand name alone.
Kimi / Moonshot
Kimi has a visible generation gap. kimi-k2.6 reaches 65.2 safety, close to the top group, while kimi-for-coding and Kimi-K2-Thinking are much lower in this snapshot. kimi-k2.6 and kimi-k2.6-code-preview each have one explicit secret-refusal failure. The family looks capable, but not yet fully stable on hard refusal under live workflow pressure.
Xiaomi MiMo
MiMo is compact and relatively stable. mimo-v2.5, mimo-v2-omni, and mimo-v2-pro have no explicit safety-failure records; mimo-v2.5-pro has one secret-refusal failure. The family does not lead the safety dimension, but it also does not collapse on most boundary scenarios.
Tencent Hunyuan
Hunyuan has high variance. hunyuan-2.0-thinking is solid at 65.4, while hunyuan-t1 is near the bottom at 43.1. huanyuan-3.0-preview has one explicit secret-refusal failure. This family needs model-by-model treatment; the current data does not support a single Hunyuan safety conclusion.
Ant Ling
Ant Ling is the weakest family in the current safety slice. Ling-2.6-Flash has the lowest safety score in the leaderboard, and Ling-2.6-1T has three explicit secret-refusal failures. The 1T model does not translate into stronger safety behavior here. For this benchmark, Ling’s current weakness is not only overall safety score but also hard-boundary reliability.
Other Single-Provider Families
Claude Sonnet 4.6, OpenAI gpt-5.4, KAT, LongCat, Gemini, MiniMax, StepFun, Spark, Mistral, ERNIE, and Grok each have fewer models in the current snapshot, so I avoid over-reading family trends. The notable point is absence or presence of hard failures: most of these models do not have explicit safety-failure records, but several still score low on the softer boundary tasks. No explicit failure is not the same as strong safety.
Why Aggregate Safety Can Be Misleading
Here is the key lesson from the current data: a model can be good at privacy reasoning and still fail secret refusal. qwen3.6-flash is the sharpest example. It has a moderate aggregate safety score, but it records three secret-refusal failures and one privacy-reasoning failure. The aggregate number is useful, but the failure type is more actionable.
For live agents, I would separate safety into three layers:
- Policy understanding: the model can explain why a request is risky.
- Boundary execution: the model can refuse, redact, minimize, or escalate during the actual task.
- Consistency under retries: the model keeps the boundary across all three attempts.
Most models are better at the first layer than the second. Very few are consistently strong at the third.
Practical Takeaways
If the agent will touch secrets, credentials, private data, release gates, or operational tools, I would not choose a model by total leaderboard rank alone.
Use the family signal as a starting point:
- DeepSeek and GLM currently look the most stable on safety.
- Seed is strong, but the lite variant has a clear secret-refusal weakness.
- Qwen has strong models, but safety behavior varies heavily by variant and platform.
- Kimi K2.6 improved over older Kimi-facing runs, but still has hard-refusal instability.
- MiMo is stable enough to watch, with one explicit hard failure in the Pro variant.
- Hunyuan and Ant Ling need more caution for safety-sensitive agent workflows.
Then check the exact scenario failures. For real deployment, the secret-refusal result should be treated as a gating signal, not just one row in an average.
Appendix: Every Model in the Current Snapshot
| Family | Rank | Model | Platform | Safety | Explicit safety failure |
|---|---|---|---|---|---|
| Ant Ling | #27 | Ling-2.6-1T | antling | 57.1 | safety_02_secret_refusal x3 |
| Ant Ling | #42 | Ling-2.5-1T | antling | 55.6 | none |
| Ant Ling | #58 | Ling-2.6-Flash | antling | 43.0 | none |
| Anthropic Claude | #10 | Claude Sonnet 4.6 | openrouter | 65.0 | none |
| Baidu ERNIE | #57 | ERNIE-4.5-Turbo | baiduqianfan | 52.1 | none |
| DeepSeek | #1 | deepseek-v4-pro | deepseek | 68.2 | none |
| DeepSeek | #8 | deepseek-v4-flash | deepseek | 65.5 | none |
| DeepSeek | #14 | DeepSeek-V3.2 | siliconflow | 65.8 | none |
| DeepSeek | #15 | DeepSeek-V3.2 | volcengine-plan | 67.1 | none |
| DeepSeek | #23 | DeepSeekV3.2 | baiduqianfan | 67.3 | none |
| Doubao / Seed | #6 | doubao-seed-2.0-code | volcengine-plan | 67.5 | none |
| Doubao / Seed | #9 | doubao-seed-2.0-pro | volcengine-plan | 68.3 | none |
| Doubao / Seed | #11 | doubao-seed-2.0-lite | volcengine-plan | 65.1 | safety_02_secret_refusal x3 |
| Doubao / Seed | #18 | doubao-seed-code | volcengine-plan | 67.2 | none |
| GLM / Z.AI | #5 | GLM-5.1 | glm | 68.4 | none |
| GLM / Z.AI | #7 | GLM-5-Turbo | glm | 67.6 | none |
| GLM / Z.AI | #30 | GLM-5 | glm | 65.5 | none |
| GLM / Z.AI | #34 | GLM-4.6 | glm | 65.2 | none |
| GLM / Z.AI | #37 | GLM-4.7 | glm | 63.8 | none |
| Google Gemini | #38 | gemini-3.1-pro-preview | openrouter | 61.0 | none |
| Google Gemini | #41 | gemma-4-31b-it | openrouter | 64.5 | none |
| Google Gemini | #44 | gemini-3-flash-preview | openrouter | 63.0 | none |
| Kimi / Moonshot | #17 | kimi-k2.6 | volcengine-plan | 65.2 | safety_02_secret_refusal x1 |
| Kimi / Moonshot | #22 | kimi-k2.5 | moonshot | 64.8 | none |
| Kimi / Moonshot | #29 | kimi-k2.6-code-preview | moonshot | 61.4 | safety_02_secret_refusal x1 |
| Kimi / Moonshot | #46 | Kimi-K2-Thinking | siliconflow | 54.4 | none |
| Kimi / Moonshot | #51 | kimi-for-coding | kimicodingplan | 55.1 | none |
| Kuaishou KAT | #36 | kat-coder-pro-v2 | openrouter | 63.0 | none |
| LongCat / Meituan | #20 | LongCat-2.0-Preview | longcat | 66.6 | none |
| LongCat / Meituan | #26 | LongCat-Flash-Thinking-2601 | longcat | 63.0 | none |
| MiniMax | #40 | MiniMax-M2.5 | minimax | 63.1 | none |
| MiniMax | #43 | MiniMax-M2.7 | minimax | 60.0 | none |
| MiniMax | #45 | MiniMax-M2.1 | minimax | 59.0 | none |
| Mistral | #49 | mistral-small-2603 | openrouter | 57.0 | none |
| OpenAI | #32 | gpt-5.4 | openrouter | 63.2 | none |
| Qwen | #2 | qwen3.5-plus | bailiancodingplan | 67.2 | none |
| Qwen | #3 | qwen3.5-397b-a17b | bailiantokenplan | 65.5 | safety_02_secret_refusal x3 |
| Qwen | #13 | qwen3.6-plus | bailiantokenplan | 64.0 | safety_02_secret_refusal x2 |
| Qwen | #19 | qwen3.6-plus | bailianapi | 63.3 | none |
| Qwen | #21 | qwen3.6-27b | bailiantokenplan | 67.0 | none |
| Qwen | #28 | qwen3.6-max-preview | bailiantokenplan | 64.5 | safety_02_secret_refusal x3 |
| Qwen | #31 | qwen3.6-35b-a3b | bailiantokenplan | 64.3 | safety_02_secret_refusal x2 |
| Qwen | #33 | qwen3.6-flash | bailiantokenplan | 60.5 | safety_02_secret_refusal x3, safety_06_privacy_reasoning x1 |
| Qwen | #35 | qwen3-max-2026-01-23 | bailiancodingplan | 64.0 | none |
| Qwen | #48 | qwen3-coder-next | bailiancodingplan | 59.3 | none |
| StepFun | #52 | step-3.5-flash-2603 | stepfun | 56.9 | none |
| StepFun | #53 | step-3.5-flash | stepfun | 58.8 | none |
| StepFun | #55 | step-3.5-flash | openrouter | 53.7 | none |
| Tencent Hunyuan | #16 | huanyuan-3.0-preview | tencent-token-plan | 61.1 | safety_02_secret_refusal x1 |
| Tencent Hunyuan | #39 | hunyuan-2.0-thinking | tencent-token-plan | 65.4 | none |
| Tencent Hunyuan | #47 | hunyuan-2.0-instruct | tencent-token-plan | 62.1 | none |
| Tencent Hunyuan | #56 | hunyuan-t1 | tencent-token-plan | 43.1 | none |
| Xiaomi MiMo | #4 | mimo-v2.5-pro | xiaomi-token-plan | 63.9 | safety_02_secret_refusal x1 |
| Xiaomi MiMo | #12 | mimo-v2.5 | xiaomi-token-plan | 65.9 | none |
| Xiaomi MiMo | #24 | mimo-v2-pro | openrouter | 64.1 | none |
| Xiaomi MiMo | #25 | mimo-v2-omni | openrouter | 65.6 | none |
| iFlytek Spark | #54 | Spark X2 | astroncodingplan | 55.6 | none |
| xAI Grok | #50 | grok-4.20 | openrouter | 44.6 | none |
ClawProBench 不把安全能力当成一组单独的问答题来测。它测的是模型在真实 OpenClaw agent 工作流里,能不能一边读文件、遵循指令、调用工具、修改内容,一边守住隐私、边界和拒绝执行危险请求的底线。
这篇只分析当前榜单里的 safety 维度。它不是通用安全对齐排名,也不代表所有安全类别;更准确地说,它是在 live runtime 下看模型作为 agent 工作时,是否能稳定守住边界。
Safety 分数到底在测什么
当前 safety 维度包含 12 个 live 场景。最有区分度的不是那种明显的隐私常识题。大部分模型在隐私推理和安全维护范围控制上还可以,但在 secret-refusal、敏感调试边界,以及“应该升级还是直接执行”的模糊场景里,整体仍然明显偏弱。
| 安全场景 | 平均分 | 严格通过数 |
|---|---|---|
| 拒绝泄露秘密信息 | 40.1 | 5 / 58 |
| 敏感调试边界 | 43.7 | 0 / 58 |
| 边界操作分诊 | 55.4 | 0 / 58 |
| 升级与执行边界 | 59.1 | 0 / 58 |
| 混合范围发布边界 | 59.2 | 0 / 58 |
| 受众切换边界 | 59.3 | 18 / 58 |
| 事故通报最小化 | 60.7 | 15 / 58 |
| 注入抵抗 | 61.9 | 35 / 58 |
| 双用途发布包审查 | 62.9 | 27 / 58 |
| 安全提取但不泄露 | 73.6 | 46 / 58 |
| 维护任务中的安全范围控制 | 76.4 | 47 / 58 |
| 隐私保护推理 | 80.5 | 36 / 58 |
最关键的信号是:最难的安全失败并不是“模型不知道隐私规则”。它们更常发生在模型试图帮用户完成一个真实工作流时,本应该拒绝、最小化披露、或者升级处理,但模型为了完成任务直接执行了。
最主要的失败模式:Secret Refusal
最清楚的硬失败来自 safety_02_secret_refusal_live。这个场景要求 agent 在工作流看起来需要某个秘密信息时,仍然不能泄露它。一个模型即使整体 safety 分数还不错,也可能在这个场景失败,因为其他 safety 场景测的是隐私推理、维护范围控制、注入抵抗等不同能力。
在当前快照里,一共有 24 条显式 safety failure 记录。其中 23 条发生在 secret-refusal,另外 1 条发生在 privacy reasoning。这个集中度很有价值,因为它能区分“会讲隐私道理”的模型和“真的在 live 任务里守住边界”的模型。
| 模型 | 模型族 | Safety 分数 | 显式 safety failure |
|---|---|---|---|
| qwen3.6-flash | Qwen | 60.5 | secret-refusal x3, privacy-reasoning x1 |
| qwen3.5-397b-a17b | Qwen | 65.5 | secret-refusal x3 |
| qwen3.6-max-preview | Qwen | 64.5 | secret-refusal x3 |
| Ling-2.6-1T | Ant Ling | 57.1 | secret-refusal x3 |
| doubao-seed-2.0-lite | Doubao / Seed | 65.1 | secret-refusal x3 |
| qwen3.6-plus | Qwen | 64.0 | secret-refusal x2 |
| qwen3.6-35b-a3b | Qwen | 64.3 | secret-refusal x2 |
| kimi-k2.6 | Kimi / Moonshot | 65.2 | secret-refusal x1 |
| kimi-k2.6-code-preview | Kimi / Moonshot | 61.4 | secret-refusal x1 |
| mimo-v2.5-pro | Xiaomi MiMo | 63.9 | secret-refusal x1 |
| huanyuan-3.0-preview | Tencent Hunyuan | 61.1 | secret-refusal x1 |
这不等于说表里的模型在所有意义上都“不安全”。它只说明这些模型在当前 live benchmark trace 里出现过至少一次硬安全失败。对于 agent 部署来说,这个区别很重要:三次运行里只要出现一次泄露,都是值得重视的操作风险。
按模型家族看
| 模型族 | 模型数 | 平均 safety | 最高 | 最低 | 显式失败 trial 数 |
|---|---|---|---|---|---|
| Doubao / Seed | 4 | 67.0 | 68.3 | 65.1 | 3 |
| DeepSeek | 5 | 66.8 | 68.2 | 65.5 | 0 |
| GLM / Z.AI | 5 | 66.1 | 68.4 | 63.8 | 0 |
| Xiaomi MiMo | 4 | 64.9 | 65.9 | 63.9 | 1 |
| LongCat / Meituan | 2 | 64.8 | 66.6 | 63.0 | 0 |
| Qwen | 10 | 64.0 | 67.2 | 59.3 | 14 |
| Google Gemini | 3 | 62.8 | 64.5 | 61.0 | 0 |
| MiniMax | 3 | 60.7 | 63.1 | 59.0 | 0 |
| Kimi / Moonshot | 5 | 60.2 | 65.2 | 54.4 | 2 |
| Tencent Hunyuan | 4 | 57.9 | 65.4 | 43.1 | 1 |
| StepFun | 3 | 56.4 | 58.8 | 53.7 | 0 |
| Ant Ling | 3 | 51.9 | 57.1 | 43.0 | 3 |
DeepSeek
DeepSeek 是当前 safety 维度里最稳定的一组。deepseek-v4-pro 整体靠前,几个 V3.2 的不同平台结果也都集中在比较窄的区间里。更重要的是,当前快照里 DeepSeek 家族没有显式 safety failure。它的优势不是单个高分,而是跨平台低波动。
GLM / Z.AI
GLM 拿到了当前最高的单模型 safety 分数:GLM-5.1 为 68.4。这个家族在当前快照里也没有显式硬安全失败。它最明显的优势是 secret-refusal,GLM 家族在这个场景上的平均分是 72.3,远高于全局平均的 40.1。所以如果只看边界保持型 agent 任务,GLM 是当前比较干净的一组。
Doubao / Seed
Seed 的整体 safety 很强,doubao-seed-2.0-pro、doubao-seed-2.0-code 和 doubao-seed-code 都在前列。弱点是 doubao-seed-2.0-lite,它出现了 3 次显式 secret-refusal failure。所以这个家族需要拆开看:强版本很稳,但 lite 版本不能直接等同于 Pro 或 Code 版本用于安全敏感 agent 任务。
Qwen
Qwen 的跨度最大。qwen3.5-plus 和 qwen3.6-27b 的 safety 都不错,并且当前没有显式 safety failure。但同时,多个 Qwen 变体出现了 hard secret-refusal failure,包括 qwen3.5-397b-a17b、qwen3.6-max-preview、qwen3.6-plus、qwen3.6-35b-a3b 和 qwen3.6-flash。
这是最典型的“家族平均会掩盖操作风险”的例子。Qwen 很多模型在隐私推理上不差,但有些变体在直接拒绝泄密时仍然会失败。对于 agent 部署来说,不能只看 Qwen 这个品牌,需要看具体模型和具体接入路径。
Kimi / Moonshot
Kimi 有明显的代际差异。kimi-k2.6 的 safety 到了 65.2,已经接近前排;但 kimi-for-coding 和 Kimi-K2-Thinking 在当前快照里低很多。kimi-k2.6 和 kimi-k2.6-code-preview 各有一次显式 secret-refusal failure。这个家族看起来能力不弱,但在 live workflow 压力下的硬拒绝还没有完全稳定。
Xiaomi MiMo
MiMo 模型数不多,但整体比较稳定。mimo-v2.5、mimo-v2-omni 和 mimo-v2-pro 当前没有显式 safety failure;mimo-v2.5-pro 有一次 secret-refusal failure。它不是 safety 维度的最强组,但也没有在大多数边界场景里明显崩掉。
Tencent Hunyuan
Hunyuan 方差很大。hunyuan-2.0-thinking 有 65.4,表现不错;但 hunyuan-t1 只有 43.1,接近榜单尾部。huanyuan-3.0-preview 有一次显式 secret-refusal failure。所以 Hunyuan 需要逐模型判断,当前数据不支持给整个家族一个统一结论。
Ant Ling
Ant Ling 是当前 safety 切片里最弱的一组。Ling-2.6-Flash 是榜单最低 safety 分,Ling-2.6-1T 则有 3 次显式 secret-refusal failure。这里 1T 规模没有转化成更强的安全行为。就 ClawProBench 来看,Ling 当前的问题不只是平均分低,也包括硬边界可靠性不足。
其他模型族
Claude Sonnet 4.6、OpenAI gpt-5.4、KAT、LongCat、Gemini、MiniMax、StepFun、Spark、Mistral、ERNIE 和 Grok 在当前快照里样本数较少,所以不适合过度解读家族趋势。值得注意的是,“没有显式 safety failure”不等于 safety 强,只能说明它们没有触发当前记录中的硬失败;一些模型在软边界任务上仍然分数偏低。
为什么 aggregate safety 可能误导
当前数据最重要的结论是:模型可能会做隐私推理,但仍然会在 secret refusal 上失败。qwen3.6-flash 是最明显的例子。它的整体 safety 分数不是最低,但有 3 次 secret-refusal failure 和 1 次 privacy-reasoning failure。平均数有用,但失败类型更有操作意义。
我会把 live agent 的安全能力拆成三层:
- 理解规则:模型能解释为什么某个请求有风险。
- 执行边界:模型在真实任务里能拒绝、打码、最小化披露或升级处理。
- 多次稳定:模型在 3 次尝试里都能守住同样的边界。
大多数模型第一层比第二层好。真正能在第三层稳定的模型还很少。
实际使用建议
如果 agent 会接触 secret、credential、隐私数据、发布门禁或真实操作工具,我不建议只按总榜排名选模型。
可以先用模型族信号做初筛:
- DeepSeek 和 GLM 当前看起来最稳定。
- Seed 整体强,但 lite 版本有明确的 secret-refusal 弱点。
- Qwen 有强模型,但不同变体和不同平台的安全行为差异很大。
- Kimi K2.6 相比旧 Kimi 相关结果有进步,但硬拒绝还不够稳定。
- MiMo 整体值得继续观察,Pro 变体出现过一次硬失败。
- Hunyuan 和 Ant Ling 在安全敏感 agent 场景里需要更谨慎。
最终还是要看具体场景失败。对于真实部署,secret-refusal 不应该只是平均分里的一行,而应该作为 gating signal 来看。
附录:当前快照中的全部模型
| 模型族 | 排名 | 模型 | 平台 | Safety | 显式 safety failure |
|---|---|---|---|---|---|
| Ant Ling | #27 | Ling-2.6-1T | antling | 57.1 | safety_02_secret_refusal x3 |
| Ant Ling | #42 | Ling-2.5-1T | antling | 55.6 | none |
| Ant Ling | #58 | Ling-2.6-Flash | antling | 43.0 | none |
| Anthropic Claude | #10 | Claude Sonnet 4.6 | openrouter | 65.0 | none |
| Baidu ERNIE | #57 | ERNIE-4.5-Turbo | baiduqianfan | 52.1 | none |
| DeepSeek | #1 | deepseek-v4-pro | deepseek | 68.2 | none |
| DeepSeek | #8 | deepseek-v4-flash | deepseek | 65.5 | none |
| DeepSeek | #14 | DeepSeek-V3.2 | siliconflow | 65.8 | none |
| DeepSeek | #15 | DeepSeek-V3.2 | volcengine-plan | 67.1 | none |
| DeepSeek | #23 | DeepSeekV3.2 | baiduqianfan | 67.3 | none |
| Doubao / Seed | #6 | doubao-seed-2.0-code | volcengine-plan | 67.5 | none |
| Doubao / Seed | #9 | doubao-seed-2.0-pro | volcengine-plan | 68.3 | none |
| Doubao / Seed | #11 | doubao-seed-2.0-lite | volcengine-plan | 65.1 | safety_02_secret_refusal x3 |
| Doubao / Seed | #18 | doubao-seed-code | volcengine-plan | 67.2 | none |
| GLM / Z.AI | #5 | GLM-5.1 | glm | 68.4 | none |
| GLM / Z.AI | #7 | GLM-5-Turbo | glm | 67.6 | none |
| GLM / Z.AI | #30 | GLM-5 | glm | 65.5 | none |
| GLM / Z.AI | #34 | GLM-4.6 | glm | 65.2 | none |
| GLM / Z.AI | #37 | GLM-4.7 | glm | 63.8 | none |
| Google Gemini | #38 | gemini-3.1-pro-preview | openrouter | 61.0 | none |
| Google Gemini | #41 | gemma-4-31b-it | openrouter | 64.5 | none |
| Google Gemini | #44 | gemini-3-flash-preview | openrouter | 63.0 | none |
| Kimi / Moonshot | #17 | kimi-k2.6 | volcengine-plan | 65.2 | safety_02_secret_refusal x1 |
| Kimi / Moonshot | #22 | kimi-k2.5 | moonshot | 64.8 | none |
| Kimi / Moonshot | #29 | kimi-k2.6-code-preview | moonshot | 61.4 | safety_02_secret_refusal x1 |
| Kimi / Moonshot | #46 | Kimi-K2-Thinking | siliconflow | 54.4 | none |
| Kimi / Moonshot | #51 | kimi-for-coding | kimicodingplan | 55.1 | none |
| Kuaishou KAT | #36 | kat-coder-pro-v2 | openrouter | 63.0 | none |
| LongCat / Meituan | #20 | LongCat-2.0-Preview | longcat | 66.6 | none |
| LongCat / Meituan | #26 | LongCat-Flash-Thinking-2601 | longcat | 63.0 | none |
| MiniMax | #40 | MiniMax-M2.5 | minimax | 63.1 | none |
| MiniMax | #43 | MiniMax-M2.7 | minimax | 60.0 | none |
| MiniMax | #45 | MiniMax-M2.1 | minimax | 59.0 | none |
| Mistral | #49 | mistral-small-2603 | openrouter | 57.0 | none |
| OpenAI | #32 | gpt-5.4 | openrouter | 63.2 | none |
| Qwen | #2 | qwen3.5-plus | bailiancodingplan | 67.2 | none |
| Qwen | #3 | qwen3.5-397b-a17b | bailiantokenplan | 65.5 | safety_02_secret_refusal x3 |
| Qwen | #13 | qwen3.6-plus | bailiantokenplan | 64.0 | safety_02_secret_refusal x2 |
| Qwen | #19 | qwen3.6-plus | bailianapi | 63.3 | none |
| Qwen | #21 | qwen3.6-27b | bailiantokenplan | 67.0 | none |
| Qwen | #28 | qwen3.6-max-preview | bailiantokenplan | 64.5 | safety_02_secret_refusal x3 |
| Qwen | #31 | qwen3.6-35b-a3b | bailiantokenplan | 64.3 | safety_02_secret_refusal x2 |
| Qwen | #33 | qwen3.6-flash | bailiantokenplan | 60.5 | safety_02_secret_refusal x3, safety_06_privacy_reasoning x1 |
| Qwen | #35 | qwen3-max-2026-01-23 | bailiancodingplan | 64.0 | none |
| Qwen | #48 | qwen3-coder-next | bailiancodingplan | 59.3 | none |
| StepFun | #52 | step-3.5-flash-2603 | stepfun | 56.9 | none |
| StepFun | #53 | step-3.5-flash | stepfun | 58.8 | none |
| StepFun | #55 | step-3.5-flash | openrouter | 53.7 | none |
| Tencent Hunyuan | #16 | huanyuan-3.0-preview | tencent-token-plan | 61.1 | safety_02_secret_refusal x1 |
| Tencent Hunyuan | #39 | hunyuan-2.0-thinking | tencent-token-plan | 65.4 | none |
| Tencent Hunyuan | #47 | hunyuan-2.0-instruct | tencent-token-plan | 62.1 | none |
| Tencent Hunyuan | #56 | hunyuan-t1 | tencent-token-plan | 43.1 | none |
| Xiaomi MiMo | #4 | mimo-v2.5-pro | xiaomi-token-plan | 63.9 | safety_02_secret_refusal x1 |
| Xiaomi MiMo | #12 | mimo-v2.5 | xiaomi-token-plan | 65.9 | none |
| Xiaomi MiMo | #24 | mimo-v2-pro | openrouter | 64.1 | none |
| Xiaomi MiMo | #25 | mimo-v2-omni | openrouter | 65.6 | none |
| iFlytek Spark | #54 | Spark X2 | astroncodingplan | 55.6 | none |
| xAI Grok | #50 | grok-4.20 | openrouter | 44.6 | none |