AI Model Compression Toolkit for Efficient Deployment

Reverse engineered prompt

Build me a Python toolkit that helps people make big AI models smaller and faster without having to hand tune every detail.

I want someone to be able to point it at a Hugging Face or ModelScope model like Qwen, DeepSeek, Hunyuan, or GLM, pick a compression method, and run a simple command to produce a compressed version. It should support common quantization options like FP8, INT8, INT4, NVFP4, and very low bit research methods where possible. It should also include speculative decoding support, distillation workflows, benchmarking, and simple examples for language models, vision language models, audio models, and diffusion models.

Please include clear configs, scripts, and docs so users can reproduce examples, compare memory use and speed, and understand the tradeoffs in accuracy. Make the project feel practical for researchers and developers, with sensible defaults and room to add new algorithms later. Look up current docs online if you need to.

Want more depth? Deep Reverse

tencent/AngelSlim — reverse-engineered prompt

Reverse engineered prompt