We can load Nutch project to eclipse. This is useful for running the crawl from within eclipse and development purposes. In this article, we will learn how to install and set up eclipse for running Nutch and how to load and run nutch within eclipse.
Setting up Nutch Trunk
To load Nutch to eclipse, we need a different distribution of Nutch called Nutch Trunk. Get it by running the following command in terminal.
svn co https://svn.apache.org/repos/asf/nutch/trunk
You many need to install subversion before this which can be done by running the following command.
sudo apt-get install subversion
After the download is complete, move the Nutch trunk to any folder you wish, then edit trunk/conf/Nutch-site.xml file and add the following lines between the configuration tags, carefully entering the path to your trunk’s plugins folder.
<property> <name>plugin.folders</name> <value>/home/asmat/dev/trunk/build/plugins</value> </property>
Now it’s time to build Nutch trunk. It is done by running the following command in the Nutch trunk directory.
Setting up Eclipse
Download the latest version of eclipse from this link. Select Linux from the menu on the right and download the 64 bit version.
This will download eclipse. Extract the files and inside the eclipse-installer directory, find and run “eclipse-inst”. Select Eclipse IDE for Java Developers.
Specify installation directory and click install. This will take several minutes to download and install eclipse.
After we have installed eclipse, we need to install Eclipse SVN provider, the Subclipse. Open eclipse, go to Help -> Install New Software and enter the following address in the “work with” text box.
Then select Subclipse from the items shown and follow the steps to complete the installation.
You may need to install JavaHL on your machine in case eclipse gives error after installing Subclipse.
Installing Apache IvyDE
IvyDE integrates apache Ivy dependency management into eclipse. To install it, go to the this link on eclipse marketplace. Drag and drop the install button into your running eclipse.
The plugin installation dialogue will appear, follow the installation steps to complete the installation. If you experience any issue, alternatively you can install the plugin through Help -> Install New Software by entering this address, http://www.apache.org/dist/ant/ivyde/updatesite
The m2e plugin provides maven integration for eclipse. Recent versions of eclipse come with m2e plugin pre-installed, if not, install it through Help -> Install New Software by entering this address.
Loading Nutch to Eclipse
Now it’s time to load the Nutch project into eclipse.
- Go to File -> Import, then select General -> Existing Project into Workspace
- Next, browse and select trunk as the root directory of the project to import, then click finish.
Eclipse will take a while to load the project and build its workspace.
- Right click on the Nutch/trunk project and go to Build Path -> Configure Build Path. In the window that appear, look for trunk/conf, select it and click the Top Apply the changes and wait for a while for eclipse to complete building its workspace again.
That’s it, we have successfully loaded Nutch into eclipse. We can use it now for debugging, development and running a crawl from within eclipse.
Running Nutch in Eclipse
The following procedure will run the inject phase of the crawl from within eclipse, we can run the rest of the phases by providing the necessary arguments and the main class in the same way.
- Right click on the trunk project and go to Run As -> Run Configurations… In the window that appears, create a new configuration under Java Application. Name it Inject and provide the main class apache.nutch.crawl.Injector by searching it in the list of available classes.
- Next, in the Arguments tab, provide the necessary arguments to Nutch to perform the inject process. These arguments include the directory where the seed list resides, the crawlDB directory and the VM arguments. Create a crawlDB directory and a directory for seed URLs in the trunk root directory and enter their paths in the order <crawl dir> <urls dir> separated by space. Enter “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log” in the VM arguments.
- Now click run, you should see a result like this.
In case eclipse gives “Multiple SLF4J binding error”, carefully read this article to resolve this issue.
Next Article: Nutch Plugins – Introduction and Development
- Web Crawling and Data Mining with Apache Nutch
- Solr in Action
- Enterprise Lucene and Solr
- Apache Solr for Indexing Data
- Apache Solr: A Beginner Guide
- Apache Hadoop YARN