SysRepair-Bench evaluates autonomous agents on their ability to remediate misconfigurations, vulnerable dependencies, unsafe permissions on real systems. Each scenario is a reproducible Docker container/ Virtual Machine seeded with a known-vulnerable state drawn from public red-team material (CCDC hardening checklists, the Metasploitable 2 OpenVAS report, VulnHub VM write-ups, the Metasploitable 3 OpenVAS report, Hivestorm, Newly Designed Metasploitable 4).
For each scenario, given only the running container and Optional(threat description), an agent must perform system-administration actions (edit configuration, install/remove packages, adjust permissions, manage services, etc.) until:
- PoC check — the original vulnerability is no longer exploitable, AND
- Regression check — the affected service still functions correctly.
Remediation is scored as successful only if both checks pass.
The benchmark comprises 313 scenarios across five VM classes (six suites): ccdc/ (50), meta2/ (40), vulnhub/ (30), meta3/ubuntu/ (19), meta3/windows/ (21), and meta4/ (137, comprising 117 Docker container scenarios + 20 Active Directory VM scenarios) — plus a 16-scenario hivestorm/ free-roam track that ships alongside the binary-pass/fail suites and uses weighted partial-credit scoring.
| VM Class / Suite | Era | Built | Source |
|---|---|---|---|
ccdc/ |
2015–2022 | 50 | CCDC blue-team hardening scripts (TAMU linuxmonkeys, LATech/UTSA SWCCDC, team checklists) on Ubuntu 25.10 |
meta2/ |
2008–2012 | 40 | OpenVAS scan of Metasploitable 2.0 on Ubuntu 8.04. ⚠ Linux host only (see Host Requirements) |
vulnhub/ |
2012–2022 | 30 | Per-VM vulnerability rebuilds (Kioptrix, DC-series, Mr-Robot, SickOs, Symfonos, etc.) on Debian 11 |
meta3/ubuntu/ |
2014–2020 | 19 | Port of Rapid7 Metasploitable 3 (Ubuntu 14.04) — Drupalgeddon, ProFTPD mod_copy, payroll_app, Docker group escalation, WEBrick, UnrealIRCd, Samba, phpMyAdmin. Vendors the Rapid7 Chef cookbook under BSD-3. |
meta3/windows/ |
2016–2020 | 21 | Rapid7 Metasploitable 3 (Windows Server) — Struts, Jenkins, ManageEngine, GlassFish, Tomcat, ElasticSearch, IIS WebDAV, SMB. Scoped by the Windows OpenVAS scan. ⚠ Windows host only (see Host Requirements) |
meta4/ |
2022–2026 | 137 | Container suite (117 Docker scenarios) covering modern CVEs (Log4Shell family, Spring4Shell, PwnKit, Dirty Pipe, GameOver(lay), regreSSHion, Leaky Vessels, XZ backdoor, Copy Fail CVE-2026-31431, crAPI/DVGA/VAmPI API surfaces, LocalStack/MinIO/ArgoCD/k3s cloud-on-localhost misconfigs, ImageMagick, Memcached, curl SOCKS5, Redis Lua sandbox, Adminer, Apache Solr, Rsync, Cacti, and more) plus an Active Directory VM lab (meta4/ad-vm/, 20 scenarios: Zerologon, NoPac, ADCS ESC1–ESC8, Kerberoasting, DCSync, PrintNightmare, PetitPotam, and more). Kernel-coupled scenarios ship a Vagrant VM (meta4/kernel-vm/). |
hivestorm/ |
HS20–HS23 | 16 | Free-roam Hivestorm-style scenarios (Debian/Ubuntu/CentOS/Windows Server-Core/FreeBSD/AD-DC). Identities (backdoor account, trojan path, rogue cron, SUID plant) are randomized per build; the scorer emits weighted partial credit via JSONL checks rather than binary pass/fail. |
Every scenario's (expect hivestorm) threat.md is labeled with one of five operational remediation categories that mirror how security-operations teams classify remediation work:
- Access Control — authentication, authorization, user privileges, file ownership. Typical actions:
chmod,chown,usermod,passwd,visudo,sshd_config, PAM. - Configuration Hardening — insecure defaults, missing security directives, misconfigured service parameters. Typical actions: edits to
nginx.conf,sshd_config,my.cnf,apache2.conf,pg_hba.conf, followed bysystemctl reload/restart. - Dependency & Package Management — outdated packages with known CVEs, inherently compromised services, unnecessary high-risk daemons. Typical actions:
apt-get upgrade,--only-upgrade,remove,systemctl disable. - Network Security & Firewall Policy — exposed ports, missing firewall rules, unrestricted listener scope. Typical actions:
ufw,iptables, bind-address changes, TCP wrappers,netstat/ssauditing. - Compensating Controls, This covers vulnerabilities where direct remediation is not possible or not desirable — the package cannot be upgraded because a dependent legacy app requires the specific version, the software is end-of-life with no vendor patch, or the service cannot be restarted during business hours. The agent must instead apply network-level restrictions (firewall scoping, bind-to-localhost), application-layer mitigations (WAF rules,
mod_rewriteguards, ACLs), or safe config-directive removals while keeping the service usable. Scoring adds a third dimension: compensating-control adequacy — whether the applied controls meaningfully reduce the attack surface.
Distribution of base severity scores across all 313 scenarios. Scores follow CVSS v3.1; scenarios without a CVE (CCDC misconfigs, Hivestorm free-roam) are unscored.
| Severity | CVSS v3.1 Range | # Scenarios |
|---|---|---|
| Critical | 9.0–10.0 | 93 |
| High | 7.0–8.9 | 107 |
| Medium | 4.0–6.9 | 44 |
| Low | 0.1–3.9 | 1 |
| Unscored (misconfig / free-roam) | — | 68 |
| Total | 313 |
| Remediation Category | # Scenarios |
|---|---|
| Configuration Hardening | 113 |
| Dependency & Package Management | 71 |
| Access Control | 64 |
| Compensating Controls | 39 |
| Network Security | 10 |
| Free-roam (multiple) | 16 |
| Total | 313 |
| Service / Application Type | # Scenarios |
|---|---|
| Web Server | 59 |
| Enterprise / Infrastructure | 34 |
| System / Auth | 31 |
| Container / Runtime | 18 |
| Database / Cache | 19 |
| SSH / Remote Access | 17 |
| CMS / Web Admin Panel | 17 |
| Legacy / Backdoor Service | 16 |
| DNS / mDNS | 12 |
| Kernel / OS Privilege | 13 |
| Firewall / Network Policy | 11 |
| Application Server / Java | 12 |
| File Sharing | 11 |
| Library / Language Runtime | 11 |
| Free-roam (Hivestorm) | 16 |
| Mail / Messaging | 7 |
| FTP | 6 |
| CI/CD / DevOps | 3 |
| Total | 313 |
sysrepair-bench/
├── ccdc/ # 50 CCDC-derived scenarios (scenario-01..50)
├── meta2/ # 40 Metasploitable 2 / OpenVAS scenarios (scenario-01..40; S34-S40 = Compensating Controls)
├── vulnhub/ # 30 VulnHub-derived scenarios (scenario-01..30)
├── meta3/ubuntu/ # 19 Metasploitable 3 (Ubuntu 14.04) scenarios + vendored Chef cookbook (shared/)
├── meta3/windows/ # 21 Metasploitable 3 (Windows Server) scenarios (harness validation)
├── meta4/ # 137 modern-CVE scenarios (117 Docker + 20 AD-VM)
│ ├── kernel-vm/ # Vagrant VM for kernel-coupled LPE scenarios (S19, S21, S22, S117)
│ └── ad-vm/ # Vagrant AD lab (Win2019 DC+CA + Kali attacker, S01–S20)
├── hivestorm/ # 16 free-roam Hivestorm-style scenarios (weighted partial-credit)
├── openvas-scan-reports/ # OpenVAS scan PDFs scoping meta2 and meta3/windows
├── inspect_eval/ # Inspect AI harness: solvers, task wiring, run presets
└── README.md
Every scenario, across all three suites, follows the same layout at minimum:
scenario-NN/
├── Dockerfile # Builds the vulnerable container
├── threat.md # Severity, CVE, affected service, remediation steps
└── verify.sh # exit 0 = remediated + functional, exit 1 = failed
Scenarios are scoped so that fixes are expressible as system-administration primitives:
| Action | Examples |
|---|---|
edit_file_parameter |
sshd_config, nginx.conf, my.cnf, php.ini, pg_hba.conf |
install_package / update_package |
fail2ban, ufw, openssl, samba |
remove_package |
telnetd, rsh-server, nmap, backdoors |
chmod / chown |
/etc/shadow, web roots, SUID binaries |
service_stop / service_disable |
rlogin, avahi-daemon, cups |
iptables_block |
Backdoor ports (1524, 1099, 6200, …) |
Every scenario is scored on two mandatory objectives, plus — for Compensating-Controls scenarios — a third:
- Security objective (primary). The specific vulnerability described in
threat.mdis eliminated. Verified by the scenario'sverify.shPoC block: a CVE is no longer exploitable, a misconfiguration is corrected, an insecure service is disabled or hardened, permissions are properly restrictive. - Service availability (regression). Every service that was operational before remediation stays operational afterward. A fix that patches the vulnerability but kills the web server, database, or SSH management path is scored as a failure. Verified by the scenario's
verify.shregression block. - Compensating-control adequacy (Compensating Controls category only). Where direct remediation is forbidden by the scenario constraints,
verify.shadditionally asserts that the attack-surface reduction is in place (firewall rule present, listener scoped to loopback, WAF/mod_rewriteguard active, unsafe config directive removed).
A scenario is scored success only if all applicable objectives pass.
Scenarios may additionally track command count, wall-clock, safety violations (destructive commands outside remediation scope), hallucination (claimed actions that were not executed), and invariant preservation (prior hardening not undone while fixing the target).
SysRepair-Bench does not cover:
- Source code modification. The agent never edits application source, generates code patches, or runs application test suites. That is the domain of SWE-bench and automated program repair.
- Web-application vulnerabilities requiring code fixes (SQLi/XSS/CSRF). Web-server configuration hardening (directory listing, security headers, disabling unsafe modules) is in scope; changing application logic is not.
- Cloud-native / Kubernetes-specific issues. IAM policy, orchestration misconfig, and cloud-service settings are out of scope.
- Zero-days with no known remediation. Every scenario has at least one valid remediation path; the benchmark tests whether agents find and execute it.
- Hardware / firmware vulnerabilities (Spectre, Meltdown, etc.).
SysRepair-Bench builds every scenario from source. Depending on which suites you intend to run, you will need some or all of Docker, Vagrant + VirtualBox, Python (via uv), and a small set of platform-specific toggles. This section lists everything the repo needs to work correctly.
# Anonymous repository URL (provided by reviewing system)
git clone <anonymous-repo-url>
cd sysrepair-bench| Tool | Minimum | Purpose |
|---|---|---|
git |
2.30+ | clone + submodule handling |
| Docker Engine / Docker Desktop | 24.x+ | builds and runs every container scenario |
uv |
0.4+ | Python env + lockfile for the Inspect AI harness |
| Python | 3.11+ (installed automatically by uv sync) |
harness runtime |
bash |
4+ | prepare.sh, verify.sh, seed scripts (Git Bash / WSL / macOS / Linux) |
Install the Inspect AI harness:
cd inspect_eval
uv sync # creates .venv, installs Inspect AI + providers
cd ..No other system-level Python packages are needed — everything the scenarios do runs inside containers or VMs.
No extras beyond Docker. These use modern base images (Ubuntu 14.04 / 22.04, Debian 11 / 12, Alpine, vendor images). On first run, the harness will docker build each scenario on demand. Runtime-escape scenarios (Leaky Vessels, docker.sock, --privileged abuse) are auto-elevated by the harness when a .needs-privileged marker is present in the scenario dir — no manual docker run --privileged needed.
The Metasploitable 2 suite uses
lpenz/ubuntu-hardy-amd64(Ubuntu 8.04). Hardy's glibc requires the legacyvsyscallpage, which is disabled in the Docker Desktop / WSL2 kernels shipped on Windows and macOS. Every process SIGSEGVs (exit 139) beforeapt-getruns.
Requirements:
- A native Linux host (or a VM) with one of:
- kernel booted with
vsyscall=emulate(default on most distro kernels < 6.x; explicit on 6.x) - a WSL2 custom kernel rebuilt with
CONFIG_LEGACY_VSYSCALL_EMULATE=y(advanced, not supported by Docker Desktop)
- kernel booted with
Pre-build the shared Hardy base (auto-built on first run, or manually):
docker build -t sysrepair/meta2-hardy:latest meta2/_baseThe harness injects cap_add=["NET_ADMIN"] for every scenario and privileged=True where a .needs-privileged marker is present (inspect_eval/sysrepair_bench/task.py:274-275), so iptables-manipulating scenarios (meta2/scenario-37, scenario-39) and runtime-escape scenarios need no manual runtime flags when launched via uv run python -m sysrepair_bench.run.
Windows containers (
mcr.microsoft.com/windows/servercore) share the Windows NT kernel with the host; they cannot run on Linux or macOS. Meta4 has no Windows-container scenarios — only Linux images; its Windows host requirement is limited tometa4/kernel-vm/(see 3d).
Applies to:
- All 20
meta3/windows/scenarios (Server Core ltsc2019/ltsc2022) - Hivestorm Windows-container scenarios:
scenario-03-win10,scenario-04-win2019,scenario-05-win2016(ltsc2016 — note below),scenario-08-win-iis,scenario-11-win-dc-dns - (Hivestorm
scenario-13-ad-dc-win2019is VM-based, not container-based — see 3e)
Requirements:
-
Windows 10/11 Pro/Enterprise or Windows Server 2019+ (Home editions lack Containers + Hyper-V isolation)
-
Docker Desktop switched to Windows Containers mode (tray-icon → "Switch to Windows containers"), or a native Windows
dockerd -
Hyper-V + Containers features (elevated PowerShell):
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All Enable-WindowsOptionalFeature -Online -FeatureName Containers -All
-
Internet access for
docker buildto pullmcr.microsoft.com/windows/servercore:ltsc201{6,9}and the pinned legacy installers on first build (or an offline mirror for air-gapped builds)
Isolation mode. The harness auto-injects isolation: hyperv for every Windows-container scenario (task.py:22-30, task.py:275), so mismatched builds like ltsc2016 (hivestorm scenario-05-win2016) run correctly on any supported Windows host. For manual docker run outside the harness, enable Hyper-V isolation one of two ways:
- Docker Desktop (GUI): Settings → General → enable "Use the WSL 2 based engine" is not what you want for Windows containers — instead right-click the tray icon → Switch to Windows containers, then in Settings → General toggle Use Hyper-V isolation by default (wording varies by version), Apply & Restart.
- Native
dockerd/ daemon config: add"exec-opts": ["isolation=hyperv"]to%ProgramData%\docker\config\daemon.jsonand restart the Docker service.
Per-scenario isolation recommendations for manual runs are in meta3/windows/README.md.
3d. meta4/kernel-vm/ — VirtualBox VM for kernel-coupled LPE scenarios (S21, S22, S117; optionally S19)
These scenarios target kernel vulnerabilities. Containers share the host kernel, so they need a VM whose kernel matches the vulnerable ABI range. S19 (Dirty Pipe) additionally requires a separate Ubuntu 20.04 HWE host — or remediate in compensating-control mode (
chattr +i) on any host. S117 (Copy Fail, CVE-2026-31431) runs on the existing VM's pinned 5.15 kernel (no backport exists) — or remediate by blacklistingalgif_aead.
Requirements:
-
BIOS/UEFI: Intel VT-x / AMD-V enabled (optionally VT-d / AMD-Vi)
-
Windows hosts: Hyper-V stack disabled so VirtualBox can claim VT-x:
dism.exe /Online /Disable-Feature:Microsoft-Hyper-V-All /NoRestart dism.exe /Online /Disable-Feature:VirtualMachinePlatform /NoRestart dism.exe /Online /Disable-Feature:HypervisorPlatform /NoRestart dism.exe /Online /Disable-Feature:Containers /NoRestart bcdedit /set hypervisorlaunchtype off
Also: Windows Security → Device security → Core isolation → Memory Integrity OFF, then reboot.
-
VirtualBox 7.x and Vagrant 2.4.x:
Windows (via Scoop):
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser irm get.scoop.sh | iex scoop install git vagrant scoop bucket add extras scoop install virtualbox
Ubuntu / Debian:
sudo apt install -y virtualbox vagrant sudo usermod -aG vboxusers "$USER" # log out / back in afterwards
-
Bring up the VM:
cd meta4/kernel-vm vagrant up # Ubuntu 22.04, kernel pinned pre-fix, Docker installed vagrant ssh
Note: kernel-scenarios inside the VM require docker run --privileged to exercise the host kernel's userns behavior.
Container scenarios (01, 02, 06, 07, 09, 10, 12, 15, 16) run on any Docker host. Windows scenarios (03, 04, 05, 08, 11) require the same host as meta3/windows/.
Before every run, regenerate randomized identities (backdoor account, trojan path, SUID plant, rogue cron, legit admin name):
bash hivestorm/prepare.sh # all scenarios, random seed
SEED=42 bash hivestorm/prepare.sh # reproducible
bash hivestorm/prepare.sh 01 # single scenarioscenario-15-docker-host (dockerd-in-container) is auto-elevated by the harness via its .needs-privileged marker.
These use Vagrant; AD-DC and FreeBSD cannot run inside containers.
| Scenario | Box | Provider | Extras |
|---|---|---|---|
scenario-13-ad-dc-win2019 |
gusztavvargadr/windows-server-2019-standard |
VirtualBox (default) or Hyper-V | Vagrant ≥ 2.3, VirtualBox ≥ 6.1; first boot ~15 min (ADDS promote + reboot + seed) |
scenario-14-freebsd13 |
freebsd/FreeBSD-13.2-RELEASE |
VirtualBox (default) or libvirt | Vagrant ≥ 2.3, VirtualBox ≥ 6.1; first boot ~5–8 min |
bash hivestorm/prepare.sh 13
cd hivestorm/scenario-13-ad-dc-win2019
vagrant upThe Inspect AI harness needs at least one provider. Set the env vars for whichever you use:
| Provider | Env var |
|---|---|
OpenAI / OpenAI-compatible (vLLM, Ollama via OPENAI_BASE_URL) |
OPENAI_API_KEY, optionally OPENAI_BASE_URL |
| Anthropic | ANTHROPIC_API_KEY |
| Google (Gemini) | GOOGLE_API_KEY |
| Hugging Face Inference | HF_TOKEN |
For local inference, point OPENAI_BASE_URL at a vLLM / Ollama / LM Studio endpoint and set OPENAI_API_KEY to any non-empty string.
# Sanity: build + verify a single container scenario
cd vulnhub/scenario-01
docker build -t sysrepair-vulnhub-01 .
docker run -d --name test-01 sysrepair-vulnhub-01
docker exec test-01 /bin/bash /verify.sh
echo $? # 1 = baseline still vulnerable (expected before remediation)
docker rm -f test-01
# Sanity: harness smoke test (one scenario, ReAct solver)
cd ../../inspect_eval
uv run python -m sysrepair_bench.run smokecd vulnhub/scenario-01
docker build -t sysrepair-vulnhub-01 .
docker run -d --name test-01 sysrepair-vulnhub-01Drop the agent into the container (or let it operate via its own tools):
docker exec -it test-01 /bin/bash
# ... agent makes configuration, package, or permission changes ...docker exec test-01 /bin/bash /verify.sh
echo $? # 0 = remediated and service still works, 1 = failedEach threat.md is written as a self-contained prompt and provides:
- Severity and CVSS score
- CVE (where applicable)
- Affected service (binary, port, config path)
- Vulnerable configuration snippet
- Remediation steps
Agents should be evaluated under either a zero-knowledge variant (only the container is exposed) or a one-day variant (threat.md is provided as context). The verify.sh grader is the same in both cases.
An end-to-end evaluation harness built on Inspect AI lives in inspect_eval/. It loads scenarios from every suite, runs an agent solver against each one in a Docker sandbox, invokes verify.sh, and records pass/fail plus trajectory telemetry.
cd inspect_eval
uv sync
# Single scenario smoke test
uv run python -m sysrepair_bench.run smoke
# Full meta2 suite, ReAct solver, local Ollama
uv run python -m sysrepair_bench.run meta2_react_localPresets are declared in inspect_eval/runs.yaml. Each preset pins a model, solver, benchmark selection, and timeouts.
| Preset | Purpose |
|---|---|
smoke |
One-scenario sanity check (meta2/scenario-01, ReAct) |
meta2_react_local |
Full meta2/ suite under ReAct, local model |
meta2_lats_local |
Full meta2/ suite under LATS tree search |
pas_gpt |
Plan-and-Solve on meta2/ |
full_reflexion_qwen |
Reflexion across meta2, vulnhub, ccdc |
full_matrix |
10 open-weight models × 5 solvers × 3 benchmarks (HPC) |
react, basic, reflexion, plan_and_solve, lats — all exposed via the solver: key in a preset.
Defaults in runs.yaml: time_limit=1800s, token_limit=500k, bash_timeout=180s, verify_timeout=300s. Every bash and verify.sh invocation has an explicit timeout so a hung service can't stall the run; LATS marks timed-out nodes as fatal rather than re-expanding them. The tool surface is bash-only (no Python) because Metasploitable-2-era containers may not ship a Python interpreter.
See inspect_eval/README.md for the full list of task parameters, scoring fields, and harness internals.
Author and citation information are withheld for double-blind review.
Scenarios draw on public material from:
- Collegiate Cyber Defense Competition (CCDC) team hardening toolkits — TAMU linuxmonkeys, LATech 2023 SWCCDC, UTSA 2023 SWCCDC
- OpenVAS scan of Metasploitable 2.0
- VulnHub community VMs (Kioptrix, DC-series, Mr-Robot, SickOs, Symfonos, FristiLeaks, LinSecurity, Brainpan, De-ICE, PwnOS)
- Rapid7 metasploitable3 —
meta3/ubuntu/shared/cookbooks/vendors portions of the upstream Chef cookbook (BSD-3-Clause, © Rapid7, Inc.) to provision the Meta3-Ubuntu software stack (Drupal, payroll_app, phpMyAdmin, ProFTPD, UnrealIRCd, Samba). Full attribution inmeta3/ubuntu/shared/UPSTREAM_LICENSE. The Windows sub-suite will similarly reference the upstream Packer/Vagrant installer scripts once authored. - OpenVAS / Greenbone for the scan reports in
openvas-scan-reports/that drive the meta2 and meta3 scenario scopes