TPC DS Toolkit for Apache Impala Benchmarking

Reverse engineered prompt

Build me a practical TPC DS toolkit for Apache Impala that helps someone generate benchmark data, load it, and run the standard queries without hunting through a bunch of scattered instructions.

I want it to include query templates that stay compliant with the TPC DS rules, a set of ready to run sample queries for a large scale factor like 10 TB, and scripts for creating external text tables, creating Parquet tables, loading the Parquet data, and computing stats in Impala. The scripts should be easy to edit for the source file location, target database names, and table paths.

Please include a data generation area that wraps the official dsdgen flow for Hadoop or MapReduce style generation, plus clear README instructions for installing the basic build tools, generating flat files, loading them into Impala, and using dsqgen if someone wants to make more queries. Keep it simple, command line focused, and documented enough that a data engineer can clone it and follow the steps.

Want more depth? Deep Reverse

cloudera/impala-tpcds-kit — reverse-engineered prompt

Reverse engineered prompt