Apodex Deep Research Benchmark Harness

Reverse engineered prompt

Build me a Python evaluation harness for Apodex 1.0 that can reproduce its public deep research benchmark runs in a standard ReAct style setup. I want something I can point at an Apodex model endpoint or any OpenAI style model API, add the needed keys in a local env file, download the benchmark datasets, and then run either a quick smoke test or a full benchmark from the command line.

It should support the public benchmark suite mentioned in the repo, including things like BrowseComp, BrowseComp ZH, DeepSearchQA, and the text only HLE case where the user has to provide the licensed file manually. Please make the runs isolated per question so failed or stuck samples are easier to debug and rerun, and add a simple way to check progress and aggregate accuracy after a run.

Keep the setup straightforward, document the steps clearly in the README, and wire in the web search, web fetch, and code sandbox keys from the env file. Look up current docs online if you need to.

Want more depth? Deep Reverse

ApodexAI/AgentHarness — reverse-engineered prompt

Reverse engineered prompt