Say I've X documents what algorithm/library/tika config/nekohtml filter would tell me which of those is an "article" and which is not, and for those that are give me the article text ...
I have a solr index which contains millions of textual documents, all submitted by users. Quite a lot of these documents are potential spam. In my webapp, I show a "related ...