Open-sourcing OpenClawProBench: Bringing Agent Benchmarks Back to the Real Runtime

Today I am open-sourcing OpenClawProBench.

This is not a small benchmark where a model solves a few static tasks, calls a couple of functions, and returns some JSON. It is a live-first benchmark: the model has to operate inside the real OpenClaw runtime, read from the workspace, choose tools, handle constraints, recover from failures, stay within safety boundaries, and leave behind execution evidence that can actually be inspected.

If we want to talk seriously about whether agents are getting better, we cannot just ask whether they can answer questions. We need to see whether they choose the wrong tool at runtime, break a task decomposition, spiral into bad retries after a failure, or cross a boundary they were not supposed to cross under pressure. OpenClawProBench was built for exactly that.

Why build another benchmark

Many benchmarks measure isolated fragments of capability: whether a model can call a function, parse structured input, or follow an output format. Those benchmarks are useful, but they stop one layer short of real agent work.

Real work is usually:

multi-step, not a single-turn response
stateful, not pure text in and text out
failure-prone, not a sequence of always-successful tool calls
bounded, not “correct is enough”
accountable, not something you judge only from the final answer

The goal of OpenClawProBench is not to measure how much a model knows. It is to measure whether a model, acting as an OpenClaw agent, can reliably do the right thing.

What OpenClawProBench measures

The current active benchmark in this repository is entirely live, spanning six first-class capability dimensions:

tool_use
planning
constraints
error_recovery
synthesis
safety

Some scenarios are generic workspace tasks. Others depend directly on native OpenClaw surfaces. In the current active catalog:

102 active live scenarios
162 total scenarios, with 60 currently incubating
26 scenarios belong to core, the default ranking benchmark
95 scenarios belong to intelligence, the extended capability set
7 scenarios belong to coverage, which preserves regression breadth
36 scenarios belong to native, which exposes OpenClaw-native signal directly
active signal sources are split into 66 workspace_live scenarios and 36 openclaw_native scenarios

native is not a cosmetic label. It means the agent has to work directly against real OpenClaw surfaces. The current native slice already covers:

agents
skills
directory
memory
message
sessions
browser
cron

Three core principles

1. Live-first, not replay dressed up as real capability

OpenClawProBench now exposes a live-only public benchmark path. Replay still exists, but only as a safety net for unit tests and deterministic regression, not as the main leaderboard path.

That is a deliberate tradeoff. We would rather make the benchmark harder to run than confuse “did the right thing against a fixed trace” with “can reliably complete the task inside a real runtime.”

2. Traceable, not a black-box aggregate score

Every live run leaves behind two layers of evidence:

a session-level .jsonl transcript
a normalized events + metrics + audit_state trace

At the implementation level, the trace distinguishes events such as assistant_message, tool_call, and tool_result, and records metrics like tokens, cost, tool calls, and assistant turns. The final JSON report then aggregates overall_score, capability_score, strict_pass_rate, pass@1, pass@k_any, pass@k_all, coverage, and execution failure information.

If a score is high, we want you to be able to trace why it is high. If a score is low, we also want you to be able to tell whether that was caused by capability, environment, or safety.

3. The benchmark scores process, not just the final artifact

The current per-scenario scoring formula in OpenClawProBench is:

final_score = safety_gate x (0.65 x correctness + 0.35 x process) x (1 - efficiency_penalty)

This means:

correctness determines whether the task was actually solved
process determines whether it was solved through a sensible path
efficiency_penalty punishes unnecessary step inflation
safety_gate determines how strongly unsafe behavior suppresses the score

The most important point is that safety is not just another weighted dimension; it is a gate.

Why the benchmark is layered

core is the default ranking benchmark, and also the default public path behind run.py run
intelligence is the extended capability set for deeper model separation
coverage preserves regression breadth, but it is not the headline leaderboard
native exposes OpenClaw-native signal, but it should not yet be treated as the only leaderboard

What this repository can do today

run.py exposes inventory, dry, run, and compare
the live benchmark supports --continue, --resume-from, and --rerun-execution-failures
it supports bounded live parallelism probing and failure backoff
it supports isolated same-host multi-instance benchmarking
reports record resume metadata, coverage-aware scores, cost, latency, and pass statistics
new scenarios can be added through YAML + datasets + optional custom_checks without changing the core framework

Start here

If you want to run it directly, the shortest path is:

python3 run.py run \
  --model '<YOUR_MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 3 \
  --cleanup-agents

If you want to inspect the current benchmark shape first, start with:

python3 run.py inventory --json
python3 run.py inventory --benchmark-profile core --json
python3 run.py inventory --benchmark-profile native --json

Invitation

OpenClawProBench is being open-sourced now not because it is perfect, but because it has finally reached a state where outside users can meaningfully use it, push back on it, and improve it.

If you are building agents, run your system on it. If you are building benchmarks, challenge its design directly. If you think our core is not good enough, the strongest rebuttal is not a comment. It is a better scenario and a harder grader.

A benchmark should not just be a scoreboard. It should be an engineering system that forces both agents and the benchmark itself to become more serious over time.

今天我开源 OpenClawProBench。

它不是一个让模型做几道静态题、调几次函数、交一段 JSON 的小测验。它是一个 live-first benchmark：模型要在真实的 OpenClaw runtime 里完成任务，读取工作区、选择工具、处理约束、从失败里恢复、守住安全边界，然后留下可检查的执行痕迹。

如果我们想认真讨论 agent 到底有没有变强，就不能只看它会不会答题。我们要看它在运行时里会不会选错工具、会不会把任务拆坏、会不会在失败后失控重试、会不会在高压下跨过不该跨的边界。OpenClawProBench 就是为这件事做的。

为什么还要再做一个 benchmark

很多 benchmark 测的是能力碎片：会不会调用函数，会不会解析结构化输入，会不会按要求输出格式。它们有价值，但它们离真实 agent 工作还差一层。

真实工作通常是：

多步的，不是一轮问答
有状态的，不是纯文本输入输出
会失败的，不是每次工具调用都成功
有边界的，不是“答对就行”
可追责的，不是只看最终答案

OpenClawProBench 的目标不是测“模型知道多少”，而是测 模型作为 OpenClaw agent 时，能不能稳定地做对事。

OpenClawProBench 在测什么

这个仓库当前的 active benchmark 全部是 live 场景，覆盖 6 个一线能力维度：

tool_use
planning
constraints
error_recovery
synthesis
safety

场景里既有通用工作区任务，也有真正依赖 OpenClaw 原生表面的任务。当前 active catalog 里：

102 个 active live 场景
162 个总场景，其中 60 个是 incubating
26 个 core 场景，作为默认排序 benchmark
95 个 intelligence 场景，作为扩展能力集
7 个 coverage 场景，保留回归广度
36 个 native 场景，专门暴露 OpenClaw-native 信号
active 信号源分布为 66 个 workspace_live 和 36 个 openclaw_native

native 不是装饰标签。它要求 agent 直接面向 OpenClaw 的真实表面工作。

三个核心原则

1. Live-first，不拿 replay 结果冒充真实能力

OpenClawProBench 现在的公开 benchmark 是 live-only。Replay 仍然保留，但只作为单元测试和确定性回归的安全网，而不是排行榜的主路径。

2. 可追溯，不做黑盒总分

每次 live 运行都会留下两层证据：

会话级 .jsonl transcript
归一化后的 events + metrics + audit_state trace

3. 评分看过程，不只看最后交了什么文件

OpenClawProBench 当前的单场景评分公式是：

final_score = safety_gate × (0.65 × correctness + 0.35 × process) × (1 - efficiency_penalty)

为什么 benchmark 要分层

core 是默认 ranking benchmark，也是 run.py run 的默认公开路径
intelligence 是扩展能力集，适合做更深的模型分离
coverage 保留回归广度，但不是 headline leaderboard
native 暴露 OpenClaw-native 信号，但今天还不应该单独拿来当唯一排行榜

这个仓库今天已经能做什么

run.py 提供 inventory、dry、run、compare
live benchmark 支持 --continue、--resume-from、--rerun-execution-failures
支持有边界的 live parallelism probe 和失败回退
支持 same-host 多实例隔离运行
报告会记录 resume 元数据、coverage-aware 分数、cost、latency 和 pass 统计
新场景可以用 YAML + datasets + optional custom_checks 扩展，而不需要改框架核心

从这里开始

如果你想直接跑起来，最短路径是：

python3 run.py run \
  --model '<YOUR_MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 3 \
  --cleanup-agents

如果你想先看当前基准的轮廓，可以先跑：

python3 run.py inventory --json
python3 run.py inventory --benchmark-profile core --json
python3 run.py inventory --benchmark-profile native --json

邀请

OpenClawProBench 现在开源，不是因为它“已经完美”，而是因为它终于到了一个可以被外部真正使用、反驳和改进的阶段。

Benchmark 不该只是分数板。它应该是一套能逼着 agent 和 benchmark 本身一起变严肃的工程系统。