Benchmarking OpenClaw-native intelligence

A benchmark for how agents actually behave
inside the OpenClaw runtime.

OpenClawBench is designed to measure more than final answers. It evaluates whether a model can choose tools well, plan multi-step work, recover from failure, respect constraints, stay safe, and do all of this with visible efficiency, reliability, and coverage-aware reporting.

Why this benchmark exists

The goal is not to reward models for sounding plausible. The goal is to evaluate operational intelligence in an agent system: can the model act coherently, make progress under tool-mediated execution, and stay robust when the task becomes multi-step, stateful, or failure-prone?

6 core dimensions: tool use, planning, constraints, recovery, synthesis, safety
102 active scenarios in the current benchmark dataset
core default ranking profile, keeping the main leaderboard tied to high-signal tasks

Architecture that behaves like a benchmark system, not a static score sheet

OpenClawBench is structured as a full evaluation pipeline: benchmark profiles define what belongs in ranking, scenario metadata defines what each task demands, the runner controls trial execution and workspace setup, and the reporter turns every run into a reusable evidence artifact.

OpenClawBench architecture pipeline illustration
End-to-end evaluation flow

Select a benchmark slice

Profiles such as core, intelligence, coverage, full, and native make benchmark scope explicit.

Load scenario definitions

Scenario YAMLs carry prompts, tools, checks, difficulty, execution metadata, workspace seeds, and optional custom grading hooks.

Execute controlled trials

The runner creates per-trial workspaces, supports resume flows, and records live execution behavior through the OpenClaw bridge.

Score and report the run

Reports surface capability, overall score, strict pass, latency, cost, tokens, coverage, reliability, and integrity signals.

102

Active scenarios

The public website reflects the active benchmark dataset rather than every experimental task in the repository.

6

Core dimensions

Tool use, planning, constraints, recovery, synthesis, and safety remain visible as first-class benchmark axes.

Profiles

Curated slices

core, intelligence, coverage, and native expose different benchmark views without collapsing them into one opaque total.

Filters

Explicit scope control

Group, status, signal source, difficulty, execution mode, and other filters make the benchmark slice reproducible and legible.

Scoring that rewards correctness, process quality, and robustness together

The benchmark is designed so that a strong model must do more than arrive at a plausible final answer. It must also behave sensibly during execution, avoid unnecessary actions, and preserve safety under realistic task pressure.

Scoring stack

Correctness, process quality, efficiency drag, and safety gating

OpenClawBench scores more than the final answer. Deterministic checks, process scoring, efficiency pressure, and safety gating are composed into the task score, and then aggregated into benchmark-level views.

Deterministic checks Correctness

Tool calls, outputs, files, recovery signals, clarification behavior, and audit-backed evidence determine whether the task was truly solved.

Process scoring Process quality

The benchmark looks at tool appropriateness, step ordering, and redundancy, so it can distinguish clean execution from brute-force progress.

Efficiency drag Efficiency

Extra steps are penalized. Efficiency is benchmark pressure, not a decorative side metric that can be ignored once the answer is right.

Safety gate Safety

Unsafe behavior can sharply suppress or zero out a result, which prevents apparently capable but unsafe behavior from looking stronger than it is.

final_score = safety_gate × (0.65 × correctness + 0.35 × process) × (1 - efficiency_penalty)
Determinism

18 supported check types

The scoring engine has a broad rule-based vocabulary for agent behavior, covering tool usage, output validation, file-state checks, recovery signals, clarification behavior, and audit-backed evidence rather than relying on a vague impressionistic grade.

Reliability

Repeated trials matter

Reporting includes strict pass, reliability views, and coverage-aware summary metrics. This helps distinguish genuine capability from one-off success and makes the benchmark less sensitive to lucky runs.

Interpretability

The report explains failure modes

Instead of collapsing everything into one failure bucket, the reporter separates outcomes, execution failures, and integrity review signals. That makes low scores easier to interpret and benchmark maintenance easier to trust.

Dataset design that reflects benchmark intent, not just task accumulation

A benchmark becomes more useful when its task set is legible. OpenClawBench treats benchmark membership as first-class metadata, keeps difficulty explicit, and exposes multiple official slices so users can inspect both count and influence instead of staring at one opaque total.

OpenClawBench structured dataset design illustration
Difficulty weighting

Harder tasks matter more

Difficulty weights are explicit: easy, medium, hard, and expert scale as 1 / 2 / 4 / 8. This prevents the benchmark from being dominated by easy wins and lets more demanding scenarios carry proportionally more influence.

Benchmark slices

Different questions deserve different views

core asks who is strongest on the main ranking path. intelligence asks about broader capability. coverage tracks regression breadth. native makes OpenClaw-native surfaces visible without forcing them to dominate the main leaderboard prematurely.

Scenario metadata

Tasks are benchmark objects, not loose prompts

Each scenario can declare tools, checks, tags, execution mode, difficulty, pass threshold, workspace material, and custom grading logic. That makes the dataset easier to extend without weakening benchmark semantics.

Live evidence

Per-trial workspaces keep runs observable

The runner creates controlled workspaces and the reporting layer records cost, latency, token usage, and execution metadata. This keeps the dataset tied to observable agent behavior rather than to abstract answer-only evaluation.

Coverage transparency

Partial runs remain interpretable

Reports expose coverage, covered weight, normalized capability, and normalized score on covered slices. That matters because serious benchmarking should show how much evidence exists, not just a headline number.

Extensibility

New tasks can grow without rewriting the harness

Most benchmark growth happens declaratively through scenario YAML and optional custom checks. This keeps the benchmark flexible while preserving a clear boundary between content, execution, scoring, and reporting.

Why OpenClawBench is a credible way to measure agent intelligence

The benchmark does not try to measure every form of intelligence. What it does is operationalize the forms of intelligence that matter in an agent runtime: acting through tools, preserving coherence across steps, adapting under failure, respecting boundaries, and producing inspectable evidence under controlled execution.

OpenClawBench trust and evidence illustration

It measures behavior, not just outputs

Tool choices, step order, recovery behavior, and safety outcomes are all part of the observable signal.

It separates ranking from coverage

The main leaderboard can stay high-signal while broader slices remain available for analysis and regression tracking.

It makes reliability visible

Repeated-trial views help distinguish stable capability from one-off success or environment luck.

It keeps partial evidence honest

Coverage-aware reporting prevents subset runs from being mistaken for complete benchmark judgments.

It keeps benchmark maintenance legible

Scenario metadata, custom checks, and structured reporting make the system easier to extend without hiding changes inside vague evaluation logic.

It fits the OpenClaw system itself

The benchmark is built around the actual OpenClaw runtime and its native surfaces, so the evaluation target is the system that users care about.

deterministic scoring process-aware evaluation safety gating difficulty weighting coverage-aware reporting live OpenClaw execution tokens · cost · latency · strict pass