LLM Serving Engine with OpenAI-Compatible API

Reverse engineered prompt

Build me an easy to use LLM serving engine that can run popular Hugging Face models and expose them through an OpenAI compatible API. I want it to focus on fast, cheap inference, with smart memory handling so it can serve lots of requests without wasting GPU memory.

It should support streaming responses, continuous batching, prefix caching, common decoding options like sampling and beam search, and a simple way to load different model types including text, multimodal, embedding, reward, and classification models where possible. Please include support hooks for quantized models and distributed serving, but keep the first version practical and clean.

Add clear install instructions, quickstart examples, basic tests, benchmark scripts, and documentation so someone can run a local model, start the API server, and send chat style requests right away. Look up current docs online if you need to match real model serving behavior.

Want more depth? Deep Reverse

MdSufiyan005/vllm — reverse-engineered prompt

Reverse engineered prompt