Benchmarks & Leaderboards
Measured safety and reliability for tool-using agents — with and without NjiraAI.
Every score links to a reproducible run configuration, artifacts, and trace exemplars.
Shadow mode results show what would have happened without enforcement. Active mode results show what did happen with enforcement.
What we measure
Benchmarks focus on safety outcomes in real tool environments — not model "IQ".
Safety Recall (Must-not-miss)
Catastrophic Failure Rate
Safe Task Success
False Positives (Friction)
Latency Overhead (p95)
Intervention Mix
Leaderboards
Compare runs across suites, domains, policies, and deployment modes.
Loading filters...
Loading benchmarks...
Methodology
NjiraAI benchmarks measure safety and reliability at the tool boundary, where agents interact with real systems.
- •Suites: We run standard agent evaluations (e.g., web, tool-use, and safety suites) plus NjiraAI scenario packs.
- •Scoring: "Catastrophic" events are defined per domain (e.g., destructive SQL, unauthorized payments, PII leakage).
- •Modes: Shadow mode logs "would-have" interventions; Active mode enforces verdicts in-line.
- •Reproducibility: Each run links to a versioned config, policy pack version, and artifacts.
- •Limitations: Benchmarks are not a measure of general intelligence; they measure safe completion under constraints.
Ready to benchmark your agents?
Start in shadow mode to measure impact safely, then enable enforcement when you're confident.
Book a demo