Running Nutch 1.x in Eclipse

We can load Nutch project to eclipse. This is useful for running the crawl from within eclipse and development purposes. In this article, we will learn how to install and set up eclipse for running Nutch and how to load and run nutch within eclipse.

Setting up Nutch Trunk

To load Nutch to eclipse, we need a different distribution of Nutch called Nutch Trunk. Get it by running the following command in terminal.

svn co

You many need to install subversion before this which can be done by running the following command.

sudo apt-get install subversion

After the download is complete, move the Nutch trunk to any folder you wish, then edit trunk/conf/Nutch-site.xml file and add the following lines between the configuration tags, carefully entering the path to your trunk’s plugins folder.


Now it’s time to build Nutch trunk. It is done by running the following command in the Nutch trunk directory.

ant eclipse


Setting up Eclipse

Download the latest version of eclipse from this link. Select Linux from the menu on the right and download the 64 bit version.

Download Eclipse for Ubuntu

This will download eclipse. Extract the files and inside the eclipse-installer directory, find and run “eclipse-inst”. Select Eclipse IDE for Java Developers.

Install Eclipse on Ubuntu

Specify installation directory and click install. This will take several minutes to download and install eclipse.

Install eclipse on Ubuntu

Installing Subclipse

After we have installed eclipse, we need to install Eclipse SVN provider, the Subclipse. Open eclipse, go to Help -> Install New Software and enter the following address in the “work with” text box.

Then select Subclipse from the items shown and follow the steps to complete the installation.

Install Subclipse to eclipse

You may need to install JavaHL on your machine in case eclipse gives error after installing Subclipse.

Installing Apache IvyDE

IvyDE integrates apache Ivy dependency management into eclipse. To install it, go to the this link on eclipse marketplace. Drag and drop the install button into your running eclipse.

Install IvyDE on Eclipse

The plugin installation dialogue will appear, follow the installation steps to complete the installation. If you experience any issue, alternatively you can install the plugin through Help -> Install New Software by entering this address,

Installing m2e

The m2e plugin provides maven integration for eclipse. Recent versions of eclipse come with m2e plugin pre-installed, if not, install it through Help -> Install New Software by entering this address.

Install m2e on Eclipse

Loading Nutch to Eclipse

Now it’s time to load the Nutch project into eclipse.

  • Go to File -> Import, then select General -> Existing Project into Workspace

Load Nutch Trunk to eclipse

  • Next, browse and select trunk as the root directory of the project to import, then click finish.

Load Nutch into eclipse

Eclipse will take a while to load the project and build its workspace.

  • Right click on the Nutch/trunk project and go to Build Path -> Configure Build Path. In the window that appear, look for trunk/conf, select it and click the Top Apply the changes and wait for a while for eclipse to complete building its workspace again.

Load Nutch into eclipse

That’s it, we have successfully loaded Nutch into eclipse. We can use it now for debugging, development and running a crawl from within eclipse.

Running Nutch in Eclipse

The following procedure will run the inject phase of the crawl from within eclipse, we can run the rest of the phases by providing the necessary arguments and the main class in the same way.

  • Right click on the trunk project and go to Run As -> Run Configurations… In the window that appears, create a new configuration under Java Application. Name it Inject and provide the main class apache.nutch.crawl.Injector by searching it in the list of available classes.

Run Nutch in Eclipse. Provide arguments

  • Next, in the Arguments tab, provide the necessary arguments to Nutch to perform the inject process. These arguments include the directory where the seed list resides, the crawlDB directory and the VM arguments. Create a crawlDB directory and a directory for seed URLs in the trunk root directory and enter their paths in the order <crawl dir> <urls dir> separated by space. Enter “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log” in the VM arguments.

Run Nutch in Eclipse

  • Now click run, you should see a result like this.

Run Nutch Crawl in Eclipse

In case eclipse gives “Multiple SLF4J binding error”, carefully read this article to resolve this issue.

Next Article:    Nutch Plugins – Introduction and Development


Useful Resources
  1. Web Crawling and Data Mining with Apache Nutch
  2. Solr in Action
  3. Enterprise Lucene and Solr
  4. Apache Solr for Indexing Data
  5. Apache Solr: A Beginner Guide
  6. Apache Hadoop YARN



Leave a Reply

Your email address will not be published. Required fields are marked *