AI Safety Benchmark CLI Tool

Reverse engineered prompt

Build me a polished Python command line tool for benchmarking AI safety and moderation models.

I want to be able to point it at a local model, a vLLM server, OpenAI, Anthropic, or a simple custom HTTP endpoint, then run one command against built in benchmark datasets for jailbreaks, toxicity, prompt injection, general safety, and image plus text safety checks. It should work in a fast inline mode for simple runs, but also support YAML config files when I want more control. Please include curated benchmark packs like a core pack and a jailbreak pack.

Each run should save a neat output folder with the exact config used, predictions for each dataset, metrics, a summary file, and a static HTML report. I also want commands to list datasets, model backends, packs, and metrics, plus tools to inspect a run, compare two runs, and export results to CSV.

Please make it feel easy to install, easy to run, and well documented, with examples, tests, and env based API key setup. If you need details, look up current docs online.

Want more depth? Deep Reverse

Virtue-Research/guard-eval-harness — reverse-engineered prompt

Reverse engineered prompt