Indirect Prompt Injection Detection Pipeline

Reverse engineered prompt

Build me a Python project that can detect indirect prompt injection attacks in LLM related content by comparing a user’s request with outside text using embeddings, then training a few classifiers to decide whether the content is malicious or safe. I want it to be easy to run from start to finish: pull the dataset from Hugging Face, let me add my OpenAI key for the best embedding option, also support a couple of open source embedding choices, generate embeddings, train the models, evaluate them, and save the results plus clear visual charts.

Please keep the workflow simple, with one script for generating embeddings and one script for training, evaluation, and plots. I want the outputs to include accuracy, F1, ROC AUC, and inference speed so I can compare which setup performs best. Also create the expected data, results, figures, and embeddings folders automatically and make the whole thing runnable with minimal setup. If you need details, look up the current docs online and wire it up cleanly.

Want more depth? Deep Reverse

Abu-Hussain/Embedding-Based-Detection-of-Indirect-Prompt-Injection-Attacks-in-Large-Language-Models — reverse-engineered prompt

Reverse engineered prompt