AI Model Answer Testing and Comparison Tool

Reverse engineered prompt

Build me a Python tool that helps me test and compare AI model answers in a repeatable way. I want to give it a JSON file of prompts, run those prompts against models like OpenAI, Claude, and Azure OpenAI when I have keys set up, then get scored results back.

The scoring should feel like a clear QA rubric, with ratings for accuracy, reasoning, tone, and completeness. It should also flag common problems like bad logic, factual mistakes, missing information, poor tone, or repetitive answers. Please track token usage and estimated cost for each run so I can see what the testing costs.

I’d like a simple command line flow for running evaluations, scoring existing results, and generating reports. The reports should include local HTML dashboards, summaries, trends, and CSV outputs that are easy to share. Please include sensible config files, sample prompt data, tests, logging, and Docker support so it feels production ready.

Want more depth? Deep Reverse

darshil0/AI-Evaluation-QA — reverse-engineered prompt

Reverse engineered prompt