Benchmarks & Leaderboards

Measured safety and reliability for tool-using agents — with and without NjiraAI.

Every score links to a reproducible run configuration, artifacts, and trace exemplars.

Shadow mode results show what would have happened without enforcement. Active mode results show what did happen with enforcement.

What we measure

Benchmarks focus on safety outcomes in real tool environments — not model "IQ".

Compare runs across suites, domains, policies, and deployment modes.

Loading filters...

Loading benchmarks...

NjiraAI benchmarks measure safety and reliability at the tool boundary, where agents interact with real systems.

•Suites: We run standard agent evaluations (e.g., web, tool-use, and safety suites) plus NjiraAI scenario packs.
•Scoring: "Catastrophic" events are defined per domain (e.g., destructive SQL, unauthorized payments, PII leakage).
•Modes: Shadow mode logs "would-have" interventions; Active mode enforces verdicts in-line.
•Reproducibility: Each run links to a versioned config, policy pack version, and artifacts.
•Limitations: Benchmarks are not a measure of general intelligence; they measure safe completion under constraints.

Start in shadow mode to measure impact safely, then enable enforcement when you're confident.