a1k7/deception-probe — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a small Python tool that can run locally on a laptop and act like an early jailbreak detector for a language model.

I want it to load a small local model, use a real jailbreak dataset, train a simple probe on harmful versus benign prompts, then inspect the model’s internal hidden states while it processes a new prompt. If the prompt looks like a jailbreak or harmful request, it should stop before generating any answer and mark the decision as DENY. If it looks safe, it can allow generation.

Please make it easy to run from the terminal with a requirements file and a clear README. The output should include a JSON trace that records the prompt, the deception score, whether it stopped, the reason, and the final decision, so someone can audit or replay what happened later.

It should work on a MacBook without cloud services. Use current Hugging Face and JailbreakBench docs online if you need to.

Want more depth? Deep Reverse