Python Tool for Generating Legal QA Training Data

Reverse engineered prompt

Build me a Python tool for creating legal question and answer training data with help from an LLM.

I want it to take a small set of seed legal questions from a JSON file and a folder of legal reference documents, then generate new legal QA examples from that knowledge. After that, it should polish the generated data by checking the legal references and improving the reasoning paths, then run a verification step to catch wrong answers, weak logic, or mismatched citations.

Please include a simple sample knowledge base and sample seed file so I can test it quickly. The final output should include both a normal QA dataset and an enhanced version that includes reasoning paths. Make the workflow easy to run from the command line in three steps, generate, polish, and verify.

Use Python, include a requirements file, and make it clear where I should put my DeepSeek API key. Look up current API docs online if needed.

Want more depth? Deep Reverse

LAMDA-NeSy/Knowledge-Guide-Data-Generation — reverse-engineered prompt

Reverse engineered prompt