OpenDCAI/DataFlow — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a beginner friendly Python tool for preparing data for AI training. I want it to take messy sources like PDFs, plain text, low quality question answer data, and other raw files, then help users generate, clean, evaluate, filter, and export higher quality datasets for LLM training or RAG.

Make it work around reusable operators and pipelines, so someone can chain steps together, save the workflow, rerun it later, and share it with others. Include useful built in pipelines for common jobs like turning PDFs into QA data, cleaning text, creating synthetic text, math, or code data, and checking data quality.

I also want a simple visual web interface where a user can build and run pipelines without writing much code, plus a command line way to launch it. Please make it installable as a Python package, include examples, tests, basic docs, and Docker support. Look up current docs online if you need to.

Want more depth? Deep Reverse