Running Apache Nutch 1.x on Hadoop

Prior to this, we have run Nutch in standalone mode without hadoop. Though Nutch still uses Hadoop’s MapReduce mechanism for crawling and processing data but it gains its strength when it is run in a hadoop cluster. In this section, we will learn how to run Nutch on Hadoop installed in pseudo distributed mode.  We assume you have Hadoop installed on your machine. If not, refer to this article to install and configure Hadoop in pseudo distributed mode.

After you have Hadoop installed on your machine, start it by running the following commands one by one.

sbin/start-dfs.sh
sbin/start-yarn.sh

Running Nutch in Hadoop

Before you go for running a Nutch crawl with Hadoop, you need to add HADOOP_HOME/bin to the environment path. Open /etc/environment in a text editor by running this command.

 sudo gedit /etc/environment

Add a colon and HADOOP_HOME/bin to the end of PATH as shown in the following image.

Add Hadoop Common to Path

The following is the step by step procedure to run Nutch crawl with Hadoop.

  • Create the resources for the crawl, i.e. the seed list. First you need to create the directory in hdfs where we will copy the seed list. Run this command in the Hadoop home directory.
  • hadoop fs -mkdir /user/yourUserName/urls

Replace yourUserName by the user name you are logged in with. The command should look like this.

  • hadoop fs –mkdir /user/asmat/urls
  • Now create a seed list (seed.txt) with a few URLs in it somewhere in your local file system. Let the file is in /home/asmat/dev Run the following command to copy this file from local file system to the newly created directory in hdfs.
  • hadoop fs -copyFromLocal /home/asmat/dev/seed.txt /user/asmat/urls/seed.txt
  • Make sure the content of seed.txt have been copied. Run the following command to list the contents of the file.
  • hadoop fs -cat /user/asmat/urls/seed.txt
  • Finally run the following command in NUTCH_HOME/runtime/deploy directory (not in NUTCH_HOME/runtime/local) to run a crawl with Hadoop.
  • bin/crawl -s urls crawl 1

You can view the application running in Hadoop web UI by visiting localhost:8088 in your browser

Run Nutch on Hadoop

Next Article:  Integrating Nutch with Solr

 

Useful Resources
  1. Hadoop Application Architectures: Designing Real-World Big Data Applications
  2. Apache Hadoop YARN
  3. Web Crawling and Data Mining with Apache Nutch
  4. Solr in Action
  5. Enterprise Lucene and Solr
  6. Apache Solr for Indexing Data
  7. Apache Solr: A Beginner Guide

Comments

comments

Leave a Reply

Your email address will not be published. Required fields are marked *