protectai/rebuff — reverse-engineered prompt
Reverse engineered prompt
Build me a self hosted prompt injection protection tool for AI apps, basically like Rebuff. I want something I can run with a small web playground plus SDKs so an app can send in user prompts and get back whether they look unsafe.
The core idea should be layered protection, not just one check. Please include simple heuristics, an LLM based detector, a way to remember past attack patterns using embeddings in a vector database, and canary words that get inserted into prompts so we can catch instruction leakage in the model output. If a canary leaks, store that as a learned attack signal for future checks.
I’d like a TypeScript and Python friendly SDK experience, with examples for detecting suspicious input and checking canary leaks. Make the server work for self hosting with common providers like OpenAI, Supabase, and either Pinecone or Chroma if that makes sense. It’s fine to treat this as a prototype and say it won’t catch everything. Look up current docs online if you need to.
Want more depth? Deep Reverse