No description
  • Gleam 79.6%
  • Python 20.4%
Find a file
2026-04-17 20:56:17 +02:00
.github/workflows Update targets 2026-04-17 20:56:17 +02:00
proxy Add more info on scorer output 2026-04-17 20:50:49 +02:00
scorer Add more info on scorer output 2026-04-17 20:50:49 +02:00
.gitignore Gleam init 2026-03-26 17:21:47 +01:00
compose.yaml Update targets 2026-04-17 20:56:17 +02:00
Containerfile.proxy Use root containerfiles 2026-04-02 15:38:07 +02:00
Containerfile.scorer Ensure latest scorer is used; Use only zap fails in scorer; limit test urls for now 2026-04-17 13:49:59 +02:00
LICENSE Initial commit 2026-03-26 16:18:48 +00:00
oracle.json First draft of benchmarking application 2026-04-02 10:54:01 +02:00
oracle_gen.py Try scanning specific endpoints 2026-04-13 11:02:45 +02:00
oracle_zap_fails.json Try scanning specific endpoints 2026-04-13 11:02:45 +02:00
README.md Try scanning specific endpoints 2026-04-13 11:02:45 +02:00

OpenPEN Bench

Benchmark suite for OpenPEN Agent. Measures how well the agent identifies real vulnerabilities, filters out false positives, and assigns correct severities — across different models, prompts, and scan configurations.

How it works

                    ┌───────────┐
                    │  WAVSEP   │  Known vulnerable web app
                    │ :8080     │  (zaproxy/wavsep)
                    └─────▲─────┘
                          │
                    ┌─────┴─────┐
                    │   Proxy   │  Rewrites WAVSEP's descriptive URLs
                    │  :9090    │  to opaque paths (/t/a3f1b2c9)
                    │           │  so the LLM can't read the answer
                    │           │  from the URL. Writes mapping.json.
                    └─────▲─────┘
                          │
                    ┌─────┴─────┐
                    │   Agent   │  OpenPEN Agent (image under test)
                    │           │  Scans the proxy, analyzes findings
                    │           │  with LLM, writes analysis.json.
                    └───────────┘

                    After the scan:

  mapping.json ──┐
                 ├──▶  Scorer  ──▶  Scorecard
  analysis.json ─┤
                 │
  oracle.json ───┘

Components

Proxy

Sits between the agent and WAVSEP. On startup it crawls WAVSEP's index pages, assigns each test case page a random opaque ID, and builds a bidirectional path mapping.

  • Requests: translates opaque paths to real WAVSEP paths.

  • Responses: rewrites HTML (links, form actions) so the scanner and LLM only ever see opaque URLs.

  • Output: mapping.json — the full opaque-to-real path mapping.

Oracle

A small, hand-maintained config file. Maps each WAVSEP vulnerability class to its CWE, expected severity, and URL path patterns (both true positive and false positive). About 10 entries.

Known-broken WAVSEP test cases are listed separately and excluded from scoring.

Scorer

Reads analysis.json, resolves opaque URLs back to real WAVSEP paths using mapping.json, matches each finding against the oracle, and reports four metrics:

# Metric What it measures
1 Detection rate How many known vulnerabilities did we find?
2 FP filtering rate How many scanner false positives did the LLM correctly dismiss?
3 False dismissal rate How many real vulnerabilities did the LLM incorrectly call false positives?
4 Severity accuracy When we found a real vulnerability, did we assign the right severity?

Results are broken down per vulnerability class.

Running

export LLM_API_BASE=http://localhost:11434/v1
export LLM_MODEL=devstral-small-2:24b
export AGENT_REF=latest

podman-compose up --abort-on-container-exit
gleam run -- score results/analysis.json results/mapping.json

Container setup