Python Library for Pretraining Transformer Language Models

Reverse engineered prompt

Build me a small Python library for pretraining transformer language models that feels simple to use but can scale to serious GPU setups. I want a clean command line flow where I can train a tiny Llama model from a config file, save checkpoints, then run text generation from a saved checkpoint.

Please include example configs and docs for getting started, custom datasets, multi node training with Slurm, and basic debugging. The library should support data parallel, tensor parallel, and pipeline parallel training, plus examples for things like Mixture of Experts, Mamba, custom dataloaders, DoReMi, and uploading checkpoints to S3.

Make it practical for researchers, with Hugging Face login and model or dataset hub support, optional Weights and Biases logging, tests, and simple scripts for training, evaluation, and generation. Keep the API understandable and easy to debug. Look up current docs online if you need to.

Want more depth? Deep Reverse

huggingface/nanotron — reverse-engineered prompt

Reverse engineered prompt