sixfingerdev/TurboTensors — reverse-engineered prompt
Reverse engineered prompt
Build me a Python project called TurboTensors that can run small language models on CPU as fast as possible, especially on older or modest laptops with no GPU. I want it to feel like a lightweight inference engine for models around 50M to 300M params, with a focus on Turkish models like Kayra if that fits.
The main goal is low latency text generation, not fancy training stuff. Make it load safetensors with minimal memory copying, keep generation fast by reusing cached attention state, and separate the prompt processing step from the one token at a time decode step. Use JIT compiled kernels for the hot paths so it avoids as much Python overhead as possible, and make it work well with controlled threading on weaker CPUs.
Please include a simple way to run it from the command line, generate text from a prompt, and also a benchmark mode that compares its CPU speed to a standard Hugging Face style baseline. If anything is unclear, check current docs online and make sensible choices.
Want more depth? Deep Reverse