karpathy/rustbpe — reverse-engineered prompt
Reverse engineered prompt
Build me a small, fast tokenizer training library in Rust with Python bindings.
I want it to train a GPT style BPE tokenizer from an iterator of text, then let me encode and decode text with the trained tokenizer. It should default to a GPT 4 like regex pre tokenization pattern, but also let me pass in my own regex if I want. I also want a batch encode method that uses parallel processing automatically so encoding lots of texts is fast.
The main point is training in this project, then exporting the learned tokenizer in a format I can plug into tiktoken for inference. So please include a clean way to get the regex pattern and the mergeable ranks needed for that export. It should feel lightweight and simple, not like a huge tokenizer framework.
Please make it usable from Python, include basic tests for both the Rust side and Python side, and add a short example showing training, encode and decode, batch encode, and tiktoken export. Look up current docs online if you need to.
Want more depth? Deep Reverse