Java 17 Web Crawler Project

Reverse engineered prompt

I want this turned into a working, easy to run web crawler in the style of Apache Nutch. It should be a Java 17 project that I can build with the existing setup, then point at a small list of starting URLs and have it crawl the web, keep track of what it has seen, store fetched content in an organized way, and support running follow up indexing jobs.

Please make it feel practical to use, with sensible default config, clear instructions, and the important settings already called out, especially the crawler user agent and where plugins are loaded from. I also want the crawler to be extensible, so the plugin based approach should work instead of being hard coded.

If Docker support is already intended here, make that usable too. Keep the developer experience reasonable in common IDEs, but focus first on having a real crawl flow that runs end to end. If there is already an API spec or existing docs in the repo, use those instead of guessing, and look up the current project tutorial online if you need to fill in gaps.

Want more depth? Deep Reverse

apache/nutch — reverse-engineered prompt

Reverse engineered prompt