lexmata/llama-gguf — reverse-engineered prompt
Reverse engineered prompt
Build me a fast Rust based local LLM tool called llama gguf that feels like a clean alternative to llama.cpp. I want to be able to load GGUF models and also ONNX exports, download models straight from HuggingFace, inspect model info, run a prompt from the terminal, and have an interactive chat mode. It should stream tokens as they are generated and work well on CPU by default, with optional GPU acceleration where available.
Please also include an HTTP server mode with an OpenAI compatible API so other apps can talk to it, plus a simple client for remote inference. It should support the common model families mentioned in the docs, including newer mixture of experts style models, and handle the usual quantized formats people actually use.
If it fits naturally, make it usable as both a CLI app and a Rust library. Optional extras like RAG with Postgres and pgvector, and distributed inference across nodes, are great if they are behind features and do not get in the way of the basic local experience. Look up current docs online if you need to.
Want more depth? Deep Reverse