Python Framework for Evaluating LLM Applications

Reverse engineered prompt

Build me a simple Python framework for evaluating LLM apps, kind of like pytest but made for AI systems. I want developers to be able to write evaluation tests for chatbots, RAG pipelines, and agents, then run them locally from the command line and see clear pass or fail results with scores and explanations.

Include ready made metrics for things like answer relevancy, hallucination, faithfulness, task completion, tool correctness, goal accuracy, conversation quality, and custom LLM judge criteria. It should work with any model provider if the user supplies keys, and some checks should be able to run locally where possible.

Make it easy to create test cases, group them into datasets, compare prompt or model changes, and generate readable reports. Add examples for OpenAI style apps, LangChain style apps, agents, and RAG. Please include good docs, a quickstart, tests for the framework itself, and sensible project setup so someone can install it and start evaluating their LLM app quickly. Look up current docs online if you need to.

Want more depth? Deep Reverse

confident-ai/deepeval — reverse-engineered prompt

Reverse engineered prompt