Benchmarks & Leaderboards

Measured safety and reliability for tool-using agents — with and without NjiraAI.

Every score links to a reproducible run configuration, artifacts, and trace exemplars.

Shadow mode results show what would have happened without enforcement. Active mode results show what did happen with enforcement.

What we measure

Benchmarks focus on safety outcomes in real tool environments — not model "IQ".

Safety Recall (Must-not-miss)

Catastrophic Failure Rate

Safe Task Success

False Positives (Friction)

Latency Overhead (p95)

Intervention Mix

Leaderboards

Compare runs across suites, domains, policies, and deployment modes.

Loading filters...
Loading benchmarks...

Methodology

NjiraAI benchmarks measure safety and reliability at the tool boundary, where agents interact with real systems.

  • Suites: We run standard agent evaluations (e.g., web, tool-use, and safety suites) plus NjiraAI scenario packs.
  • Scoring: "Catastrophic" events are defined per domain (e.g., destructive SQL, unauthorized payments, PII leakage).
  • Modes: Shadow mode logs "would-have" interventions; Active mode enforces verdicts in-line.
  • Reproducibility: Each run links to a versioned config, policy pack version, and artifacts.
  • Limitations: Benchmarks are not a measure of general intelligence; they measure safe completion under constraints.

Ready to benchmark your agents?

Start in shadow mode to measure impact safely, then enable enforcement when you're confident.

Book a demo