- Gleam 79.6%
- Python 20.4%
| .github/workflows | ||
| proxy | ||
| scorer | ||
| .gitignore | ||
| compose.yaml | ||
| Containerfile.proxy | ||
| Containerfile.scorer | ||
| LICENSE | ||
| oracle.json | ||
| oracle_gen.py | ||
| oracle_zap_fails.json | ||
| README.md | ||
OpenPEN Bench
Benchmark suite for OpenPEN Agent. Measures how well the agent identifies real vulnerabilities, filters out false positives, and assigns correct severities — across different models, prompts, and scan configurations.
How it works
┌───────────┐
│ WAVSEP │ Known vulnerable web app
│ :8080 │ (zaproxy/wavsep)
└─────▲─────┘
│
┌─────┴─────┐
│ Proxy │ Rewrites WAVSEP's descriptive URLs
│ :9090 │ to opaque paths (/t/a3f1b2c9)
│ │ so the LLM can't read the answer
│ │ from the URL. Writes mapping.json.
└─────▲─────┘
│
┌─────┴─────┐
│ Agent │ OpenPEN Agent (image under test)
│ │ Scans the proxy, analyzes findings
│ │ with LLM, writes analysis.json.
└───────────┘
After the scan:
mapping.json ──┐
├──▶ Scorer ──▶ Scorecard
analysis.json ─┤
│
oracle.json ───┘
Components
Proxy
Sits between the agent and WAVSEP. On startup it crawls WAVSEP's index pages, assigns each test case page a random opaque ID, and builds a bidirectional path mapping.
-
Requests: translates opaque paths to real WAVSEP paths.
-
Responses: rewrites HTML (links, form actions) so the scanner and LLM only ever see opaque URLs.
-
Output:
mapping.json— the full opaque-to-real path mapping.
Oracle
A small, hand-maintained config file. Maps each WAVSEP vulnerability class to its CWE, expected severity, and URL path patterns (both true positive and false positive). About 10 entries.
Known-broken WAVSEP test cases are listed separately and excluded from scoring.
Scorer
Reads analysis.json, resolves opaque URLs back to real WAVSEP paths
using mapping.json, matches each finding against the oracle, and
reports four metrics:
| # | Metric | What it measures |
|---|---|---|
| 1 | Detection rate | How many known vulnerabilities did we find? |
| 2 | FP filtering rate | How many scanner false positives did the LLM correctly dismiss? |
| 3 | False dismissal rate | How many real vulnerabilities did the LLM incorrectly call false positives? |
| 4 | Severity accuracy | When we found a real vulnerability, did we assign the right severity? |
Results are broken down per vulnerability class.
Running
export LLM_API_BASE=http://localhost:11434/v1
export LLM_MODEL=devstral-small-2:24b
export AGENT_REF=latest
podman-compose up --abort-on-container-exit
gleam run -- score results/analysis.json results/mapping.json
Container setup