No description

Gleam 79.6%
Python 20.4%

Find a file

Kristian 5fa67beb3c Update targets		2026-04-17 20:56:17 +02:00
.github/workflows	Update targets	2026-04-17 20:56:17 +02:00
proxy	Add more info on scorer output	2026-04-17 20:50:49 +02:00
scorer	Add more info on scorer output	2026-04-17 20:50:49 +02:00
.gitignore	Gleam init	2026-03-26 17:21:47 +01:00
compose.yaml	Update targets	2026-04-17 20:56:17 +02:00
Containerfile.proxy	Use root containerfiles	2026-04-02 15:38:07 +02:00
Containerfile.scorer	Ensure latest scorer is used; Use only zap fails in scorer; limit test urls for now	2026-04-17 13:49:59 +02:00
LICENSE	Initial commit	2026-03-26 16:18:48 +00:00
oracle.json	First draft of benchmarking application	2026-04-02 10:54:01 +02:00
oracle_gen.py	Try scanning specific endpoints	2026-04-13 11:02:45 +02:00
oracle_zap_fails.json	Try scanning specific endpoints	2026-04-13 11:02:45 +02:00
README.md	Try scanning specific endpoints	2026-04-13 11:02:45 +02:00

README.md

OpenPEN Bench

Benchmark suite for OpenPEN Agent. Measures how well the agent identifies real vulnerabilities, filters out false positives, and assigns correct severities — across different models, prompts, and scan configurations.

How it works

                    ┌───────────┐
                    │  WAVSEP   │  Known vulnerable web app
                    │ :8080     │  (zaproxy/wavsep)
                    └─────▲─────┘
                          │
                    ┌─────┴─────┐
                    │   Proxy   │  Rewrites WAVSEP's descriptive URLs
                    │  :9090    │  to opaque paths (/t/a3f1b2c9)
                    │           │  so the LLM can't read the answer
                    │           │  from the URL. Writes mapping.json.
                    └─────▲─────┘
                          │
                    ┌─────┴─────┐
                    │   Agent   │  OpenPEN Agent (image under test)
                    │           │  Scans the proxy, analyzes findings
                    │           │  with LLM, writes analysis.json.
                    └───────────┘

                    After the scan:

  mapping.json ──┐
                 ├──▶  Scorer  ──▶  Scorecard
  analysis.json ─┤
                 │
  oracle.json ───┘

Components

Proxy

Sits between the agent and WAVSEP. On startup it crawls WAVSEP's index pages, assigns each test case page a random opaque ID, and builds a bidirectional path mapping.

Requests: translates opaque paths to real WAVSEP paths.
Responses: rewrites HTML (links, form actions) so the scanner and LLM only ever see opaque URLs.
Output: mapping.json — the full opaque-to-real path mapping.

Oracle

A small, hand-maintained config file. Maps each WAVSEP vulnerability class to its CWE, expected severity, and URL path patterns (both true positive and false positive). About 10 entries.

Known-broken WAVSEP test cases are listed separately and excluded from scoring.

Scorer

Reads analysis.json, resolves opaque URLs back to real WAVSEP paths using mapping.json, matches each finding against the oracle, and reports four metrics:

#	Metric	What it measures
1	Detection rate	How many known vulnerabilities did we find?
2	FP filtering rate	How many scanner false positives did the LLM correctly dismiss?
3	False dismissal rate	How many real vulnerabilities did the LLM incorrectly call false positives?
4	Severity accuracy	When we found a real vulnerability, did we assign the right severity?

Results are broken down per vulnerability class.

Running

export LLM_API_BASE=http://localhost:11434/v1
export LLM_MODEL=devstral-small-2:24b
export AGENT_REF=latest

podman-compose up --abort-on-container-exit
gleam run -- score results/analysis.json results/mapping.json