Python Package for Document Cleanup and AI Integration

Reverse engineered prompt

Build me a Python package that can take messy documents and turn them into clean content that’s easy to send to an AI app.

I want to give it a file path or a URL, like a PDF, webpage, Word doc, PowerPoint, notebook, GitHub repo, audio file, or video, and get back clean markdown, tables, images, and useful chunks. It should work without heavy setup for basic use, but also support richer scraping when an OpenAI compatible vision model is provided. Please make the model and client configurable so people can use OpenAI, OpenRouter, or a local server.

Include simple ways to split results by whole document, page, length, section, keyword, and optional smarter semantic splitting. Also add a helper that turns the scraped chunks into chat messages for multimodal models, with a text only option.

Please include tests, a clear README, install instructions, and small examples showing how to scrape a PDF and ask a model questions about it.

Want more depth? Deep Reverse

emcf/thepipe — reverse-engineered prompt

Reverse engineered prompt