HTML Main Content Extraction Tool

Reverse engineered prompt

Build me a Python tool that can take messy webpage HTML and pull out just the real main content, returning a clean HTML body and stripping things like nav bars, ads, sidebars, footers, and extra metadata. I want it to feel like a practical extractor for research and RAG style workflows, not just a scraper.

Please make it usable as a small library with a simple API for one HTML string or a batch of pages. It should support local model based extraction, with a good default setup, and also allow a remote OpenAI compatible backend if local inference is not available. Include a fallback mode so extraction still returns something sensible if the model fails, like using a classic extractor, returning the raw HTML, or returning empty output.

It would be great if the pipeline handles HTML cleanup, prompting, inference, parsing, and final extraction end to end. Add basic tests, sensible config options, and a quick example in the README. If anything is unclear, look up the current docs online and make reasonable choices.

Want more depth? Deep Reverse

opendatalab/MinerU-HTML — reverse-engineered prompt

Reverse engineered prompt