hollygrimm/voice-dataset-creation — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a local, community controlled toolkit for creating voice datasets for Indigenous language preservation and speech AI. I want it to guide people from community agreement and dataset governance first, including consent, ownership tiers, withdrawal rights, and clear checks for what should not be digitized or used for training.

After that, it should support two paths, recording new speech or working from existing recordings. It needs to split long audio into short clips, check audio quality with a signal to noise review, transcribe clips with Whisper when that fits, use MMS for more languages, and allow manual transcription when needed. Then let people review transcripts, attach cultural and speaker metadata, optionally augment small consented datasets, and export everything in LJSpeech format for either training or archival preservation.

Please keep everything running locally, not dependent on cloud services, and make the workflow easy to follow through notebooks and simple scripts. Look up current docs online if you need to.

Want more depth? Deep Reverse