Lightweight Kokoro Text to Speech Pipeline

Reverse engineered prompt

Build me a lightweight local text to speech project around Kokoro 82M. I want to paste in text, pick a matching voice and language, and generate natural sounding speech quickly without needing a huge model. It should work well for short lines and longer passages, split longer text into sensible chunks, and let me inspect the original text, the phonemes, and the audio result for each chunk.

Please make the main experience simple, with a clean pipeline or script I can run right away, plus a small demo so I can hear the output easily. Save the generated audio as wav files at 24 kHz, and make sure I can adjust speaking speed and swap voices. Support the language options mentioned in the docs, like American English, British English, Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin, with the right voice and language pairing.

It should be easy to set up on normal machines, including Windows and Apple Silicon if possible. Look up current docs online if you need to.

Want more depth? Deep Reverse

hexgrad/kokoro — reverse-engineered prompt

Reverse engineered prompt