Vision Language Model Annotation Tool

Reverse engineered prompt

Build me a SketchVLM style tool that lets a vision language model answer questions about images by drawing on top of them instead of only writing text.

I want to give it a folder of images and prompts, choose Claude, GPT, Gemini, or OpenRouter with API keys from a local env file, and have it add a coordinate grid, ask the model what to draw, then render clean SVG style strokes back onto the image without changing the original. Save the final annotated images and the model responses in a results folder so I can inspect them later.

Please include a simple way to run common tasks like maze solving, ball drop prediction, object counting, labeling parts, connecting dots, and drawing shapes around objects. Also add an optional step by step mode where the model draws one stroke at a time and sees the updated image before continuing.

Make the setup easy, document the commands, and look up current API docs online if you need to.

Want more depth? Deep Reverse

Brandon-Collins7/sketchvlm — reverse-engineered prompt

Reverse engineered prompt