Installig Nutch 1.14 on Ubuntu

In this section, we will study in detail how to configure your system so that Nutch can be installed and run on it, and the installation and configuration steps of Nutch.

Installing Java

To be able to install and run Nutch on your machine, you need to have Java installed. If you don’t have, open a terminal and run the following commands.

 sudo apt-get updatesudo apt-get install openjdk-8-jdk

Wait a couple of minutes for Java to install. Next, you need to set up JAVA_HOME. Open a terminal and run the following command

sudo gedit /etc/environment

This will open the environment file, add the following line at the end of the file and save it.

 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

Please note that your version of Java may be different. Make Sure you type the correct commands.

Finally run the following command to make sure Java has installed correctly. This should output “/usr/lib/jvm/java-8-openjdk-amd64”.

 echo $JAVA_HOME

Installing ant

To build Nutch, you need to have ant installed. Run the following command to install ant on your machine.

 sudo apt-get install ant

To make sure ant is installed correctly, run the following command,

 ant -version

This should output something like this.

Verify ant installation

Downloading and Installing Nutch

Download the source code of the latest version of Nutch 1.x from this Link.

Download Nutch

After the download is complete, open the downloaded file and extract it to any folder you wish. Now it’s time to configure Nutch.

Configurations

There are various configuration files in Nutch’s conf directory. Before we build Nutch with ant, we need to handle these configuration files. Please note that we can change these files any time when we are running Nutch in standalone mode, but for building Nutch for running in a Hadoop cluster, we need to properly configure it before we build it.

 

Nutch-site.xml

Go to the conf subdirectory in the Nutch directory, find nutch-site.xml file and open it in a text editor. Add the following lines between the <configuration> and </configuration> tags.

 <property>
   <name>http.agent.name</name>
   <value>MyNutch</value>
</property> 
<property>
   <name>http.robots.agents</name>
   <value>MyNutch,*</value>
</property> 
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)
         |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr
         |urlnormalizer-(pass|regex|basic)</value>
</property>

The plugin.includes property includes the plugins that are shipped with Nutch. These plugins will perform whatever they are designed for. If you wish not to include a specific plugin, just remove it from the list. Any custom plugin we develop is included in this list the same way.

 

Regex-urlfilter

This text file is used to tell Nutch which URLs it should not fetch and add to the index. It defines a set of rules which evaluate which URLs from the crawlDB will be fetched and indexed. It has a simple format,

[+|-][regex]

For example, let we want to tell Nutch not to fetch a URL http://www.fake.com/xyz, it is done by adding the following line to the regex-urlfilter.txt file.

-^http://www.fake.com/xyz

Nutch by default skips various file types that are included in the same file. If you wish not to skip any of them, just remove it from the list. You can also add another file type to the list.

-^(?i)\.(gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|tgz|mov|exe|jpeg|bmp|js)$

Now it’s the time to build Nutch. Open a terminal in the Nutch directory and run the following command.

 ant runtime

This may take several minutes to complete depending upon your internet speed.

Build Nutch with Ant

This will create the directory “runtime” with two subdirectories deploy and local. Deploy contains the “Nutch Job” that is used to run Nutch in a Hadoop cluster while the local directory includes Nutch installation for running in local mode, i.e. standalone mode. Currently we will focus on running Nutch in standalone mode.

Next Article:  Crawling with Apache Nutch 1.14.

 

Useful Resources
  1. Web Crawling and Data Mining with Apache Nutch
  2. Solr in Action
  3. Enterprise Lucene and Solr
  4. Apache Solr: A Beginner Guide
  5. Apache Hadoop YARN
  6. Hadoop Application Architectures: Designing Real-World Big Data Applications

 

Comments

comments

One Comment on “Installig Nutch 1.14 on Ubuntu”

Leave a Reply

Your email address will not be published. Required fields are marked *