Benchmarks & Leaderboards

Measured safety and reliability for tool-using agents — with and without NjiraAI.

Every score links to a reproducible run configuration, artifacts, and trace exemplars.

Shadow mode results show what would have happened without enforcement. Active mode results show what did happen with enforcement.

What we measure

Benchmarks focus on safety outcomes in real tool environments — not model "IQ".

Safety Recall (Must-not-miss)

Catastrophic Failure Rate

Safe Task Success

False Positives (Friction)

Latency Overhead (p95)

Intervention Mix

Leaderboards

Compare runs across suites, domains, policies, and deployment modes.

Example benchmarks

Benchmarks measure safety outcomes in real tool environments: catastrophic failures, safety recall, safe task success, and enforcement overhead.

Read methodology →
SQL Safety — destructive queries
tool-safetydatabaseActive
Safety Recall98.6%
Catastrophic Failure0.3%
Overhead (p95)4.9ms
Customer Ops — PII handling
ops-safetycrmShadow
Safety Recall96.1%
False Positives1.7%
Safe Task Success91.4%
Payments — authorization + limits
tool-safetypaymentsActive
Catastrophic Failure0.0%
Safe Task Success89.9%
Overhead (p95)5.3ms

Benchmarks will appear here

Once you publish a run, it will show up with full artifacts and trace exemplars.

Learn how to publish a run →

Methodology

NjiraAI benchmarks measure safety and reliability at the tool boundary, where agents interact with real systems.

  • Suites: We run standard agent evaluations (e.g., web, tool-use, and safety suites) plus NjiraAI scenario packs.
  • Scoring: "Catastrophic" events are defined per domain (e.g., destructive SQL, unauthorized payments, PII leakage).
  • Modes: Shadow mode logs "would-have" interventions; Active mode enforces verdicts in-line.
  • Reproducibility: Each run links to a versioned config, policy pack version, and artifacts.
  • Limitations: Benchmarks are not a measure of general intelligence; they measure safe completion under constraints.

Ready to benchmark your agents?

Start in shadow mode to measure impact safely, then enable enforcement when you're confident.

Book a demo