nutch « hadoop « Java Database Q&A

1. Number of connections to the host at the same time

How can I handle this?

2. Can't access hadoop web ui for job tracker

I'm trying to set up hadoop and nutch to run on EC2. To get started, I have followed the excellent NutchHadoopTutorial. Most everything works as it should, except that ...

3. Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr? I have a burst of information (logfiles and documents) that will be transported over ...

4. How can I develop a web crawler using nutch in Windows XP?

I'm totally new to Nutch, I've installed Tomcat and, using NetBeans I've made a little Java project, which looks like this:

public class Main {


5. Writing MetaData inside HDFS

We are using nutch to crawl our intranet site. We are extracting the meta data in xml file, in the indexing phase(We modified the code of, and when ran in local ...

6. Run Nutch on existing Hadoop cluster

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster. I've ...

7. Increase Java heap space for language-identifier plugin-in in nutch

I am trying to add a new language To Automatic Language Detection tool Apache's tika. It needs to build a language profile for adding a new language. So i am using ...

8. Setup Nutch 1.3 and Hadoop

I am a newbie to Nutch and Hadoop and trying to follow the tutorial here at So I started with Nutch 1.3 release. Even though Hadoop is included in Nutch, ...

9. i don't known what does the symbol,"#" mean in the following src of the nutch's

When I come to the following src of the nutch's, I don't known what does the symbol,"#" mean in the author's desription:

// get # of threads already accessing this addr
Integer ...

10. Nutch Crawl error - Input path does not exist

i have nutch/hadoop with 2 datanode server. I try to crawl some urls but nutch fails with this error:

Fetcher: segment: crawl/segments
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://devcluster01:9000/user/nutch/crawl/segments/crawl_generate

11. whether method cancel() and method interrupt() do the duplicate job?

I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content). Do these two method calls do the same thing: Instruction 1:

Instruction 2:
The source of the org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) is:
ParseCallable pc = new ...

12. Exploring nutch over hadoop

What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i'm not finding ...

13. Setting up nutch 1.3 and Hadoop 0.20.2

I have a multi-node cluster running on UEC(Ubuntu enterprise cloud) and i thought it will be a good idea to set up nutch with it . However, i found this tutorial unhelpful ...