Nutch Plugins – Introduction and Development

The plugin system of Nutch allows us to customize Nutch according to our needs in a very flexible and maintainable way. To actually take benefit of crawling with Nutch, we need to be able to write custom plugins for Nutch which include parsing, indexing and searching. Every task of Nutch is performed by a plugin, even the extension points which we implement to develop plugins, are themselves defined in a plugin.

Nutch's Plugin System

Nutch’s Extension Points

Let’s have a look at the extension point Nutch provides to be implemented.

  1. Parser: To be able to read through the fetched documents for extraction of data, we need to implement this extension point. This allows us to parse specific contents and extract data from parseable content.
  2. HtmlParseFilter: This is extension point for DOM-based HTML parsers. It allows us to add more metadata to HTML parses.
  3. IndexWriter: This extension point allows us to write the crawled data to some indexing backend, most commonly Solr and ElasticSearch.
  4. IndexingFilter: IndexingFilter allows us to do some custom analysis of the parsed webpages and add metadata to the indexed fields.
  5. ScoringFilter: This is a mechanism similar to Google’s Page Rank. It defines the behavior of scoring plugins.
  6. URLFilter: To write a custom plugin for defining rules for filtering out URLs other than what Nutch defines, i.e. to limit the URLs that Nutch will attempt to fetch, we implement this extension point.
  7. URLNormalizer: This extension point is used to normalize URLs and perform substitutions.
  8. Protocol: Implementing this interface allows Nutch to use different protocols to fetch documents.
  9. SegmentMergeFilter: It is used to filter segments during the segments merging process. It allows filtering based on metadata which Nutch collects during parsing pages.

Developing Nutch Plugins

One of the ways to develop Nutch plugin is to create a new package in the Nutch/trunk project loaded into eclipse (we have studied this in detail in earlier sections). This provides us a platform where we can write code and debug it. Actually running the plugin may need adding it to Nutch and rebuilding it with proper configurations. We are going to discuss it in detail.

Nutch Plugin Example

Imagine a use case, let we want to add a new field to the index that would contain the length of the text of a document.

Files

The Nutch plugin is not just a single java file but a collection of the java class which implements a specific extension point of Nutch and some xml files. Let our plugin name is textLength, then the directory structure and the files would look something like this.

textLength/
      plugin.xml
      build.xml
      ivy.xml
      src/java/org/apache/nutch/indexer/addField/
                                       AddNewField.java

AddNewField.java.

This class will implement the IndexingFilter interface which is one of Nutch’s extension points. The class will take a parsed page and will find the length of text on the page. This field will be added to the NutchDocument so that it is indexed later.

package org.apache.nutch.indexer; 

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;

public class AddNewField implements IndexingFilter
{
   private static final Log log = LogFactory.getLog(AddNewField.class);
   private Configuration conf;

   //implement the method filter()
    public NutchDocument filter(NutchDocument nutchDoc, Parse parse,
                           Text URL,CrawlDatum datum, Inlinks inLinks)
    {
         //get the text of the document and determine its length
         int textContentLength = parse.getText().length();
         //add the new field, TextLength to the index 
         nutchDoc.add("TextLength", textContentLength);
         return nutchDoc;
    } 

    public Configuration getConf()
 {                  
 return conf;
   }   

 public void setConf (Configuration conf)
 {   
         this.conf = conf; 
  }
}

 

Plugin.xml

This file contains the necessary configurations for the plugin. The plugin.xml file for the plugin we are building would look something like the following. Create a new xml file and add the following code.

<plugin id="myPlugin" name="Add New Field"   
  version="1.0.0" provider-name="TechSphot.com">
   <runtime>
      <library name="textLength.jar">
        <export name="*"/>
      </library>
   </runtime>
   <extension id="org.apache.nutch.indexer.textLength"
         name="Add Field to Index"
         point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="textLength"
          class="org.apache.nutch.indexer.AddNewField"/>
   </extension>
</plugin>

Build.xml

Next, you need to edit your build.xml file so that it looks like this

 <?xml version="1.0" encoding="UTF-8"?>
<project name="textLength" default="jar">
      <import file="../build-plugin.xml"/>
</project>

Ivy.xml

Ivy.xml file is required to build the project. Usually this file is the same for every plugin and remains unchanged. Just copy the following code to the xml file you created.

<?xml version="1.0" encoding="UTF-8"?> 

<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>
  <dependencies>
  </dependencies>
</ivy-module>

Configurations

After we have developed the plugin along with the required xml files, copy the directory containing the plugin files to NUTCH_HOME/src/plugin/ directory.

Next, we need to tell Nutch about two things:

  • Open the file NUTCH_HOME/src/plugin/build.xml in a text editor and add this line at the end of “Build & Deploy” section.
  • <ant dir="textLength" target="deploy" />

This will tell Nutch to build our custom plugin when we build it with ant.

  • Next, we need to tell Nutch to consider the newly developed plugin and do whatever it instructs. To Do it, open the Nutch-site.xml file in NUTCH_HOME/conf directory and add your plugin to the list under “plugin.includes” by adding ‘|’ symbol and then your plugin name. The list would look something like this.
  • <property>
      <name>plugin.includes</name>
      <value>urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|textLength</value>
      <description>Plugins Nutch will consider</description>
    </property>

Some plugins may need extra configurations like providing some arguments via the nutch-site.xml file or adding some code in Solr’s schema.xml etc. In our case we have to:

  • Add the following line of code to NUTCH_HOME/conf/schema.xml in the section <fields>. For your plugin to work properly, use this edited schema.xml file for integrating Nutch with Solr, which we discussed earlier.
  • <field name="textLength" type="long" stored="true" indexed="true"/>
  • Next, add the following line in NUTCH_HOME/conf/solrindex-mapping.xml
  • <field dest="textLength" source="textLength"/>

That’s it, now build Nutch by running ant clean runtime, run a crawl and see how your plugin performs.

 

Useful Resources
  1. Web Crawling and Data Mining with Apache Nutch
  2. Solr in Action
  3. Apache Solr for Indexing Data
  4. Apache Solr: A Beginner Guide
  5. Apache Hadoop YARN
  6. Hadoop Application Architectures: Designing Real-World Big Data Applications

Comments

comments

Leave a Reply

Your email address will not be published. Required fields are marked *