Python Research Repository for RL Fine Tuning

Reverse engineered prompt

Build me a clean Python research repo that teaches and runs RL fine tuning from scratch. I want simple PyTorch implementations of DPO and GRPO without using TRL, VERL, or other trainer libraries that hide the important parts.

Please make it easy to read and learn from. The code should show the actual training details like prompt and response masking, KL penalties, reward or preference handling, scheduling, saving checkpoints, and basic logging. Include separate scripts to train and evaluate DPO and GRPO.

Use Llama 3.2 1B style models as the default example. For GRPO, use GSM8K style math evaluation. For DPO, use a small chosen versus rejected safety preference dataset like Tiny Safe Pair. Support multi GPU training with PyTorch distributed, but make it clear how to run evaluation on one GPU and how to adapt training for one GPU.

Add a README with the commands, what each script does, and what results a user should roughly expect. Look up current docs online if needed.

Want more depth? Deep Reverse

mingyin0312/RLFromScratch — reverse-engineered prompt

Reverse engineered prompt