karimmarwanw/DistributedLLM — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a Python demo system that shows how lots of people can send questions to an AI model at the same time, and the work gets spread across several local services instead of one server doing everything.

I want a simple load balancer, two controller services, and a few worker services that can answer requests either with a fast fake response for stress testing or with a real local Ollama model. Add a small RAG service so I can upload text or PDF notes, store them locally, and have the workers use the most relevant chunks when answering.

Please include easy scripts to start everything, stop everything, run a 1000 user simulation, test one real Ollama question, switch between round robin, least connections, and smarter load based routing, and show basic results like success count, failures, latency, throughput, which controller and worker handled requests, and GPU usage when available. Also add demos where a worker or controller goes down and the request still gets retried somewhere else. Look up current docs online if needed.

Want more depth? Deep Reverse