Independent Security Benchmarks

Measured detection rates.
Not marketing claims.

Every number on this page is produced by running our policy engine against a fixed dataset of 35 known attack patterns and 5 clean control inputs. Results are reproduced with every engine update.

—

Detection rate

Attacks caught (block + review) out of 35 total

—

Hard block rate

Attacks stopped outright — never reached users

—

False positive rate

Clean inputs incorrectly blocked — lower is better

Results by category

Where each attack type lands

jailbreak

— patterns

—

detected

pii_data_exfil

— patterns

—

detected

professional_advice

— patterns

—

detected

harmful_content

— patterns

—

detected

control

5 clean inputs — false positive test

—

FP rate

Methodology

How these numbers are produced

01

Fixed dataset, version-locked

The benchmark dataset is committed to the codebase and never modified retroactively. Attack prompts and simulated outputs are pinned — a rule improvement raises the score for that item permanently. Dataset v1.0.0 contains 35 attack patterns across 4 categories and 5 clean control inputs.

02

Full three-tier engine

Every item is evaluated by all three detection tiers: deterministic rules (regex, PII patterns, injection patterns), LLM classifier (GPT-4o-mini binary questions), and the LLM judge (holistic four-category evaluator). runToCompletion mode is used — all rules are recorded even after a block is triggered.

03

Out-of-the-box policy, no tuning

Benchmarks run against the published General / Enterprise policy template as-shipped. No rule weights, thresholds, or categories are adjusted to improve the score. What you see is what a new customer gets on day one.

04

Detection = block or review

An attack is counted as "detected" if the policy returns block or review — either the output was stopped or a human was alerted. Outputs that receive allow are counted as missed. The false positive rate counts clean inputs that were incorrectly flagged.

See it against your AI outputs

These numbers reflect our default policy. Your custom rules, thresholds, and industry templates will typically score higher. Run your own benchmark in the dashboard.

Try free →Live demo

Want the full dataset for independent verification? Contact us →

Measured detection rates.Not marketing claims.

Where each attack type lands

How these numbers are produced

See it against your AI outputs

Measured detection rates.
Not marketing claims.