Open Source AI Observability and Evaluation Platform

Reverse engineered prompt

Build me an open source AI observability and evaluation platform for teams building LLM apps. I want a web dashboard where I can see what happened inside each AI request, troubleshoot slow or bad responses, and replay calls when I’m testing changes.

It should let me collect traces from common AI apps, run evaluations on responses and retrieval results, create versioned datasets of examples, and compare experiments when I change prompts, models, or retrieval settings. Include a prompt playground where I can test prompts, compare models, adjust parameters, and save prompt versions with tags so changes are easy to track.

Make it work with popular providers and frameworks like OpenAI, Anthropic, Google, Bedrock, LangChain, LlamaIndex, Vercel AI SDK, and agent frameworks where possible. Include a simple local setup, Docker support, example apps, and clear docs so a developer can connect their app quickly. Look up the latest docs online if you need to.

Want more depth? Deep Reverse

arize-ai/phoenix — reverse-engineered prompt

Reverse engineered prompt