Set up Hadoop 2.9.0 in Pseudo Distributed Mode on Ubuntu

In our previous article, we talked about installation of Hadoop in standalone mode on Ubuntu. In this article, we will learn how do we install Hadoop in pseudo-distributed mode. So let’s start, assuming you have Java installed and configured on your machine.

Downloading Hadoop

Go to the download page of Hadoop and grab the binary release of the latest stable version.

Downloadig Hadoop

Open the archive and extract the files to a directory of your choice.

Configuring SSH

Install ssh on your machine by running the following commands in a terminal

sudo apt-get install ssh
sudo apt-get install rsync

You should be able to ssh to localhost without a passphrase. To achieve this, run the following set of commands.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
chmod 0600 ~/.ssh/authorized_keys

Now run the following command to make sure you can ssh to the localhost without a password.

ssh localhost

Setting up Hadoop Environment

To configure Hadoop, first of all edit ~/.bashrc by running the following command

sudo gedit ~/.bashrc

Now add the following lines to the file (change the paths according to your Hadoop home directory).

export HADOOP_HOME=/home/asmat/dev/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Next, edit HADOOP_HOME/etc/hadoop/Hadoop-env.sh file, find and change the following line,

export JAVA_HOME=${JAVA_HOME}

to

export JAVA_HOME= /usr/lib/jvm/java-8-openjdk-amd64

You can find your JAVA_HOME by running the command “echo $JAVA_HOME”.

Setting up Hadoop cluster

Now we are ready to set up Hadoop in pseudo-distributed mode.

Navigate to HADOOP_HOME/etc/Hadoop/ directory and edit the following files adding the given piece of code between the <configuration> and </configuration> tags. Don’t forget to change the paths according to your HADOOP_HOME.

hdfs-site.xml

 <property>
     <name>dfs.replication</name>
     <value>1</value></property>
<property>
       <name>dfs.name.dir</name>
       <value>file:///home/asmat/hadoop/data/hdfs/namenode</value>
</property>
<property>
     <name>dfs.data.dir</name>
     <value>file:///home/asmat/hadoop/data/hdfs/datanode</value>
</property>
mapred-site.xml

Rename mapred-site.xml.template to mapred-site.xml and add the following code

 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
</property>

yarn-site.xml

 <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
</property>

 

core-site.xml
 <property>
       <name>fs.defaultFS</name>
       <value>hdfs://localhost:9000</value>
</property>

Format file system

Now it’s time to format the namenode. Open terminal in the Hadoop home directory and run the following command.

 bin/hdfs namenode -format

You should see an output like this.

17/12/29 23:11:07 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
17/12/29 23:11:07 INFO util.GSet: capacity      = 2^15 = 32768 entries
17/12/29 23:11:08 INFO namenode.FSImage: Allocated new BlockPoolId: BP-130860601-
127.0.1.1-1514571067998
17/12/29 23:11:08 INFO common.Storage: Storage directory /home/asmat/hadoop/data/hdfs/namenode has been successfully formatted.
17/12/29 23:11:08 INFO namenode.FSImageFormatProtobuf: Saving image file /home/asmat/hadoop/data/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
17/12/29 23:11:08 INFO namenode.FSImageFormatProtobuf: Image file /home/asmat/hadoop/data/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 322 bytes saved in 0 seconds.
17/12/29 23:11:08 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 017/12/29 23:11:08 INFO util.ExitUtil: Exiting with status 0
17/12/29 23:11:08 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at asmat-VirtualBox/127.0.1.1
************************************************************/

 

Running Hadoop

Finally enter the following commands in the Hadoop home directory one by one to start Hadoop daemons.

sbin/start-dfs.sh
sbin/start-yarn.sh

To confirm Hadoop is working, run this command in a terminal.

 jps

You should see an output like this. If not, there is some problem with Java path or your Hadoop configuration files.

3808 Jps
3509 ResourceManager
3126 DataNode
2986 NameNode
3356 SecondaryNameNode
3646 NodeManager

That’s it, you have successfully installed Hadoop in pseudo distributed mode on your machine.

 

Useful Resources
  1. Apache Hadoop YARN
  2. Learning Hadoop 2
  3. Hadoop 2: Quick Start Guide
  4. Hadoop for Dummies
  5. Hadoop Application Architectures: Designing Real-World Big Data Applications
  6. Web Crawling and Data Mining with Apache Nutch

Comments

comments

One Comment on “Set up Hadoop 2.9.0 in Pseudo Distributed Mode on Ubuntu”

  1. very simple steps. easy to install. Just keep the java version up to date if you are using 3.x.x version java version should be 1.8.
    thanks a lot.
    sunil.

Leave a Reply

Your email address will not be published. Required fields are marked *