karpathy/rustbpe — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a small, fast tokenizer training library in Rust with Python bindings.

I want it to train a GPT style BPE tokenizer from an iterator of text, then let me encode and decode text with the trained tokenizer. It should default to a GPT 4 like regex pre tokenization pattern, but also let me pass in my own regex if I want. I also want a batch encode method that uses parallel processing automatically so encoding lots of texts is fast.

The main point is training in this project, then exporting the learned tokenizer in a format I can plug into tiktoken for inference. So please include a clean way to get the regex pattern and the mergeable ranks needed for that export. It should feel lightweight and simple, not like a huge tokenizer framework.

Please make it usable from Python, include basic tests for both the Rust side and Python side, and add a short example showing training, encode and decode, batch encode, and tiktoken export. Look up current docs online if you need to.

Want more depth? Deep Reverse