Web Crawling with Apache Nutch 1.x

To run a crawl with Nutch, we need a list of seed URLs that may have a single to several thousand URLs. Let start with the following two URLs.

http://www.techsphot.com/

http://www.alshamssociety.com/

  • Create a directory “urls” in NUCTH_HOME/runtime/local/. Save the above Urls into a text file and store the file in the directory urls. This directory will be provided to Nutch as an argument to pick seed list from.
  • Create a directory “crawl” in NUTCH_HOME/runtime/local/. This directory will be used by Nutch to store its CrawlDB, LinkDB and segments.
  • Now run the following command in the NUTCH_HOME/runtime/local/ directory to run a crawl.
  • bin/crawl -s urls crawl 2

Where 2 tells Nutch to go for two rounds. We can increase the number of rounds if we wish to run a deeper crawl. For three rounds, Nutch will recursively crawl the URLs from the crawlDB thrice. The results of the above command should be something like this.

...
LinkDb: starting at 2017-12-28 16:35:03
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20171228163435
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2017-12-28 16:35:06, elapsed: 00:00:03
Dedup on crawldb/home/asmat/dev/nutch1.14/runtime/local/bin/nutch dedup crawl/crawldb
DeduplicationJob: starting at 2017-12-28 16:35:08
Deduplication: 0 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Deduplication finished at 2017-12-28 16:35:12, elapsed: 00:00:03
Skipping indexing …
جمعرات دسمبر 28 16:35:12 PKT 2017 : Finished loop with 2 iterations

The directory crawl will contain three subdirectories, crawldb, linkdb and segments. The directory crawldb contains the URLs Nutch will crawl recursively. Nutch adds the new URLs it finds on the web pages to this repository. Linkdb contains graph information for page scoring (Page Rank) while the segments directory stores the contents Nutch retrieves from the webpages.

Next Article:  Run Apache Nutch 1.x on Hadoop

 

Useful Resources
  1. Web Crawling and Data Mining with Apache Nutch
  2. Solr in Action
  3. Enterprise Lucene and Solr
  4. Apache Solr for Indexing Data
  5. Apache Solr: A Beginner Guide
  6. Apache Hadoop YARN

Comments

comments

Leave a Reply

Your email address will not be published. Required fields are marked *