In this section, we will study in detail how to configure your system so that Nutch can be installed and run on it, and the installation and configuration steps of Nutch.
To be able to install and run Nutch on your machine, you need to have Java installed. If you don’t have, open a terminal and run the following commands.
sudo apt-get updatesudo apt-get install openjdk-8-jdk
Wait a couple of minutes for Java to install. Next, you need to set up JAVA_HOME. Open a terminal and run the following command
sudo gedit /etc/environment
This will open the environment file, add the following line at the end of the file and save it.
Please note that your version of Java may be different. Make Sure you type the correct commands.
Finally run the following command to make sure Java has installed correctly. This should output “/usr/lib/jvm/java-8-openjdk-amd64”.
To build Nutch, you need to have ant installed. Run the following command to install ant on your machine.
sudo apt-get install ant
To make sure ant is installed correctly, run the following command,
This should output something like this.
Downloading and Installing Nutch
Download the source code of the latest version of Nutch 1.x from this Link.
After the download is complete, open the downloaded file and extract it to any folder you wish. Now it’s time to configure Nutch.
There are various configuration files in Nutch’s conf directory. Before we build Nutch with ant, we need to handle these configuration files. Please note that we can change these files any time when we are running Nutch in standalone mode, but for building Nutch for running in a Hadoop cluster, we need to properly configure it before we build it.
Go to the conf subdirectory in the Nutch directory, find nutch-site.xml file and open it in a text editor. Add the following lines between the <configuration> and </configuration> tags.
<property> <name>http.agent.name</name> <value>MyNutch</value> </property> <property> <name>http.robots.agents</name> <value>MyNutch,*</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata) |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr |urlnormalizer-(pass|regex|basic)</value> </property>
The plugin.includes property includes the plugins that are shipped with Nutch. These plugins will perform whatever they are designed for. If you wish not to include a specific plugin, just remove it from the list. Any custom plugin we develop is included in this list the same way.
This text file is used to tell Nutch which URLs it should not fetch and add to the index. It defines a set of rules which evaluate which URLs from the crawlDB will be fetched and indexed. It has a simple format,
For example, let we want to tell Nutch not to fetch a URL http://www.fake.com/xyz, it is done by adding the following line to the regex-urlfilter.txt file.
Nutch by default skips various file types that are included in the same file. If you wish not to skip any of them, just remove it from the list. You can also add another file type to the list.
Now it’s the time to build Nutch. Open a terminal in the Nutch directory and run the following command.
This may take several minutes to complete depending upon your internet speed.
This will create the directory “runtime” with two subdirectories deploy and local. Deploy contains the “Nutch Job” that is used to run Nutch in a Hadoop cluster while the local directory includes Nutch installation for running in local mode, i.e. standalone mode. Currently we will focus on running Nutch in standalone mode.
Next Article: Crawling with Apache Nutch 1.14.
- Web Crawling and Data Mining with Apache Nutch
- Solr in Action
- Enterprise Lucene and Solr
- Apache Solr: A Beginner Guide
- Apache Hadoop YARN
- Hadoop Application Architectures: Designing Real-World Big Data Applications