Bench Blog

Updates, benchmark notes, result interpretations, and design changes for ClawProBench.

ClawProBench Closed Dataset Release and Analysis

2026-05-19 Benchmark

A bilingual release note for the 68-task closed dataset, explaining how closed enterprise workflows differ from the open leaderboard and why some model rankings move sharply.

Safety Under Live Agent Work: What the ClawProBench Leaderboard Shows

2026-04-30 Benchmark

A model-family analysis of safety scores, hard safety failures, and secret-refusal behavior across the current ClawProBench leaderboard.

My Feelings During the Development of ClawProBench

2026-04-02 Development

A bilingual note on why I built ClawProBench, how the harness was shaped through self-iteration, and what I learned from running different models and coding plans.

Open-sourcing ClawProBench: Bringing Agent Benchmarks Back to the Real Runtime

2026-04-02 Benchmark

ClawProBench is designed to evaluate model intelligence under OpenClaw across planning, tool use, constraints, recovery, synthesis, and safety.