Integrating Apache Nutch 1.x with Solr

Solr is a searching platform built on the top of Lucence that is used to build search applications. It is a scalable storage and search engine that is designed to search large amounts of text-centric data. It is used to search information from a large amount of documents and data. It can also be used for storage purposes, like NoSQL databases.

According to Nutch Wiki, the version of Solr suitable to be integrated with Nutch 1.14 is Solr-6.6.0, so we will go for this version instead of the latest version. Go to the Download page of Solr and select the past versions. Look for the suitable version which is 6.6.0 in our case. Download the binary release of Solr.

Download Apache Solr For Ubuntu

Next, open the downloaded .tgz file and extract to a directory of your choice. That’s enough for a beginner, you can use Solr just by running the start command in the Solr root directory.

bin/solr start

Start Solr

You can see the admin console in your browser by entering the following URL.

http://localhost:8983/solr/

Integrating Nutch with Solr

In this section, we will learn how do we integrate Nutch with Solr so that we are able to index web pages crawled by Nutch in Solr.

First of All, you need to create a Solr core. Do it step by step as following.

  1. Create resources for the core in Solr_HOME/server/Solr/configsets/ by creating a directory for the core. Let’s call it nutchCoreConfigs. This directory will be used for all cores we are creating for working with Nutch.
  2. Now copy the conf folder from Solr_HOME/server/Solr/configsets/basic_configs/ to the directory nutchCoreConfigs.
  3. Remove the file “managed-schema” from this directory and replace it with the Solr supplied schema.xml from Solr_HOME/conf directory.
  4. Next, start Solr by running the command, bin/Solr start.
  5. Finally, create the Solr core, providing the newly created nutchCoreConfigs/conf directory as a parameter.
  6. Bin/Solr create –c nutchCore –d server/Solr/configsets/nutchCoreConfigs/conf

Indexing in Solr

Now it’s time to run a crawl with Nutch and index the data in Solr. Using the same seed list, run the following command in NUTCH_HOME/runtime/local/ directory.

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutchCore -s urls crawl 2

After the crawl finishes, the data will be indexed in Solr, which can be viewed and searched via Solr webUI. Go to http://localhost:8983/solr and load the nutchCore. In the panel on the left, click on Query to search documents in the indexed data. *:* would return everything indexed by Solr.

Searching with Solr integrated with Nutch

Next Article:  Running Apache Nutch in Eclipse

 

Useful Resources
  1. Web Crawling and Data Mining with Apache Nutch
  2. Solr in Action
  3. Enterprise Lucene and Solr
  4. Apache Solr: A Practical Approach to Enterprise Search
  5. Apache Solr for Indexing Data
  6. Apache Solr: A Beginner Guide
  7. Apache Hadoop YARN

Comments

comments

Leave a Reply

Your email address will not be published. Required fields are marked *