I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol
I am able to do it on local file systems using file:// ...
I would like to index meta-data from an RSS-feed and combine this with the parsed content from the associated PDF file in that RSS item.
Does DIH support this in any way?
Or ...
I want recrawl sites with a short interval and pdf files with a long interval, because sites change in few seconds. How can I do that in nutch 1.3?
Thanks.