kohlschutter/boilerpipe — reverse-engineered prompt
Reverse engineered prompt
I want a Java project that can take an HTML page, either from a URL or from raw HTML I already have, and give me back the main readable article text. The goal is to strip out the usual junk like navigation, ads, sidebars, headers, footers, and other boilerplate so I get clean full text from web pages.
Please make it feel like a small library I could reuse in another app, and also include a simple demo I can run locally to try it on a few pages and see the extracted text. If the page has a title, keep that too if it is easy to return. Keep the project organized and easy to build, and add a few example tests or sample inputs so I can tell it is working.
This repo looks like a work in progress port, so please fill in any missing pieces carefully and look up current docs online if you need to.
Want more depth? Deep Reverse