ClawProBench Closed Dataset Release and Analysis
A bilingual release note for the 68-task closed dataset, explaining how closed enterprise workflows differ from the open leaderboard and why some model rankings move sharply.
Updates, benchmark notes, result interpretations, and design changes for ClawProBench.
A bilingual release note for the 68-task closed dataset, explaining how closed enterprise workflows differ from the open leaderboard and why some model rankings move sharply.
A model-family analysis of safety scores, hard safety failures, and secret-refusal behavior across the current ClawProBench leaderboard.
A bilingual note on why I built ClawProBench, how the harness was shaped through self-iteration, and what I learned from running different models and coding plans.
ClawProBench is designed to evaluate model intelligence under OpenClaw across planning, tool use, constraints, recovery, synthesis, and safety.