Python SDK for Evaluating LLM Responses

Reverse engineered prompt

Build me a Python SDK for evaluating LLM responses called Athina Evals. I want users to be able to run ready made evals on model outputs, plug in their own custom evals, and send the results to an Athina style dashboard where they can inspect experiments and compare datasets side by side.

Keep it simple to install and use from a notebook or a normal Python script. Include a quick start example that shows loading a small dataset, running a suite of evaluations, viewing pass or fail results and scores, and logging everything with an API key. Make the SDK feel friendly for AI teams who want observability and experimentation without building their own evaluation system.

Please include clean package structure, helpful error messages, basic docs, and examples for both preset evals and custom evals. If code execution based evals need an optional extra dependency, make that optional and document it clearly. Look up current docs online if you need to.

Want more depth? Deep Reverse

athina-ai/athina-evals — reverse-engineered prompt

Reverse engineered prompt