nutch « hadoop « Java Database Q&A

1. Number of connections to the host at the same time stackoverflow.com

How can I handle this?

2. Can't access hadoop web ui for job tracker stackoverflow.com

I'm trying to set up hadoop and nutch to run on EC2. To get started, I have followed the excellent NutchHadoopTutorial. Most everything works as it should, except that ...

3. Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera? stackoverflow.com

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr? I have a burst of information (logfiles and documents) that will be transported over ...

4. How can I develop a web crawler using nutch in Windows XP? stackoverflow.com

I'm totally new to Nutch, I've installed Tomcat and, using NetBeans I've made a little Java project, which looks like this:

public class Main {

    /**
    ...

5. Writing MetaData inside HDFS stackoverflow.com

We are using nutch to crawl our intranet site. We are extracting the meta data in xml file, in the indexing phase(We modified the code of indexer.java), and when ran in local ...

6. Run Nutch on existing Hadoop cluster stackoverflow.com

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster. I've ...

7. Increase Java heap space for language-identifier plugin-in in nutch stackoverflow.com

I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using ...

8. Setup Nutch 1.3 and Hadoop stackoverflow.com

I am a newbie to Nutch and Hadoop and trying to follow the tutorial here at http://wiki.apache.org/nutch/NutchHadoopTutorial. So I started with Nutch 1.3 release. Even though Hadoop is included in Nutch, ...

9. i don't known what does the symbol,"#" mean in the following src of the nutch's HttpBase.java stackoverflow.com

When I come to the following src of the nutch's HttpBase.java, I don't known what does the symbol,"#" mean in the author's desription:

// get # of threads already accessing this addr
Integer ...

10. Nutch Crawl error - Input path does not exist stackoverflow.com

i have nutch/hadoop with 2 datanode server. I try to crawl some urls but nutch fails with this error:

Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate
    ...

11. whether method cancel() and method interrupt() do the duplicate job? stackoverflow.com

I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content). Do these two method calls do the same thing: Instruction 1:

t.interrupt();

Instruction 2:

task.cancel(true);

The source of the org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) is:

ParseCallable pc = new ...

12. Exploring nutch over hadoop stackoverflow.com

What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i'm not finding ...

13. Setting up nutch 1.3 and Hadoop 0.20.2 stackoverflow.com

I have a multi-node cluster running on UEC(Ubuntu enterprise cloud) and i thought it will be a good idea to set up nutch with it . However, i found this tutorial unhelpful ...