How to Setup Hadoop Cluster (Multi Node on Ubuntu)

How to Setup Hadoop Cluster (Multi Node on Ubuntu). In this post, we discuss what is Hadoop Cluster, its Architecture, and the advantages. After that, we navigate to section about how to set up Hadoop multi node cluster on Ubuntu.

Whether they are your customers or employees, a huge volume of data is being generated on a daily basis. So, tracking it is not an easy task. As a result, most companies have started relying on new technologies to collect, process, and measure data metrics. All in all, big data is an advanced form of data that provides users access to complex and extensive data sets within it as well as huge amounts of information collected from various sources. There are various tools that one relies on to store and analyse big data, but in this blog, we discuss Hadoop in detail.

All in all, Hadoop is an open source big data tool with a distributed processing framework that helps companies manage, process, and analyse their big data. Also, it offers massive data storage for customers.

Using its Hadoop Distributed File System (HDFS), Hadoop keeps all of your data, including images, videos, as well as text files. This system helps speed up data processing and storage operations.

Shall we start with How to Setup Hadoop Cluster (Multi Node on Ubuntu).

What is Hadoop Cluster?

Primarily speaking, the Hadoop Cluster is a collection of nodes designed for storing, processing, and analysing large amounts of unstructured data. They have a unique structure and their components are arranged in a Master Slave architecture. Basically, in a cluster, one machine is designated as the NameNode and another as the JobTracker. Both these machines are the masters and the remaining machines work as DataNode and TaskTracker (slaves). These master nodes coordinate and manage the overall operations of the cluster. Whereas, the slave node is responsible for streamlining data processing and storage processes.

Hadoop Architecture

Basically, Hadoop sits on a Java based software framework, that uses algorithms like MapReduce to break big data analytics processing tasks into smaller operations. Then, they are carried out in parallel and distributed across a Hadoop cluster.

So, Hadoop clusters have master and slave nodes that coordinate and carry out the different tasks across the Hadoop distributed file system. In nutshell, Hadoop’s distributed file system is the main storage system that distributes large files across different nodes in a cluster.

The master nodes include the three main components, i.e., the name node, secondary name node, and job tracker which usually run on separate machines and use high quality hardware.

Well, the Slave nodes include DataNode and TaskTracker, which use virtual machines and run on commodity hardware. They are under the control of the master nodes and perform the task of storing and processing data.

Lastly, it has the client nodes that are in charge of loading the data and obtaining the results.

Advantages of Hadoop

Besides, Hadoop has a powerful framework, making it one of the best tools for storing, processing, and managing large data sets. Here are a few more advantages of using Hadoop:

  • Scalability – Hadoop has the capability to easily manage large amounts of data, i.e., it is highly scalable. Add more nodes to its cluster and expand it, thus making it an ideal option for most organizations that want to process huge volumes of data.
  • Fault Tolerance – Considering that Hadoop is fault tolerant by design, it keeps running, even if some servers in the Cluster malfunction or fail to perform. It has numerous nodes in the cluster that replicate data and ensure that operations runs smoothly.
  • Data Processing – No matter whether you want to analyse unstructured or semi structured big data sets, Hadoop handle it all. The ability to process a significant amount of data from various sources is another feature that makes it an ideal option for businesses.
  • Supports Parallel Processing – Unlike a few other tools in the market, Hadoop uses a parallel processing technique. Simply put, it helps split large data sets into small tasks and these are later processed concurrently on various Cluster nodes. The conventional data processing technique used by Hadoop helps process big datasets more quickly and saves time.
  • Cost Effective – Hadoop has an open source architecture and operates on standard hardware. As a result, many companies create a Hadoop Cluster using cheap hardware and handle or process large data sets.
  • Highly Flexible – Data processing and analytics are just a few of the many use cases that are handled by Hadoop. It’s flexible nature makes it one of the best tools in the market for businesses that need to handle process data in a variety of ways.

How to Setup Hadoop Cluster (Multi Node on Ubuntu)

Follow this section, to show you steps how to set up Hadoop multi node cluster on Ubuntu. We use the following setup for the Hadoop cluster.

Prerequisites

  • Three servers running the Ubuntu operating system.
  • A root user or a user with sudo privileges.

Configure Hostname Resolution

Before starting, set up hostname resolution on all nodes. So, each node communicates with the other by name.

To do so, edit the /etc/hosts file on all the nodes.

				
					nano /etc/hosts
				
			

Add the following lines.

				
					69.28.88.209 masternode
69.87.219.236 datanode1
209.23.11.99 datanode2

				
			

Save and close the file when you are done.

Install Java JDK

Importantly, Hadoop is based on Java. So the Java JDK and JRE must be installed on all nodes. If not installed, you install them with the following command.

				
					apt install default-jdk default-jre -y
				
			

After installing Java, please check the Java version using the following command.

				
					java --version
				
			

See the Java version on the following output.

				
					openjdk 11.0.18 2023-01-17
OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

				
			

Next, you will need to find the Java install location on your server. Find it with the following command.

				
					dirname $(dirname $(readlink -f $(which java)))
				
			

This shows you the Java installation path in the following output.

				
					/usr/lib/jvm/java-11-openjdk-amd64
				
			

Configure SSH Password Less Authentication

Next, you need to set up SSH password less authentication for all nodes. So the Master node connects to data nodes without a password.

Firstly, create a dedicated user for Hadoop on all nodes with the following command.

				
					adduser hadoop
				
			

Set your user’s password as shown below.

On the master node, switch the user to Hadoop and generate an SSH key pair with the following command.

				
					su - hadoop
ssh-keygen -t rsa
				
			

You should see the generated key pair on the following screen.

Next, append the generated public key to the master node authorized_keys file.

				
					cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
				
			

Then, copy the SSH public key to both data nodes with the following command.

				
					ssh-copy-id hadoop@datanode1
ssh-copy-id hadoop@datanode2
				
			

Install Hadoop on All Nodes

Perform the following steps on all nodes to install the Hadoop.

First, switch the user to Hadoop with the following command.

				
					su - hadoop
				
			

Next, go to the Hadoop official download page, pick the latest available version, and download it with the following command.

				
					wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
				
			

After downloading Hadoop, extract the downloaded file using the following command.

				
					tar xzf hadoop-3.3.4.tar.gz
				
			

Next, rename the extracted directory to Hadoop.

				
					mv hadoop-3.3.4 hadoop
				
			

Now, edit the ~/.bashrc file and add the Hadoop environment variables.

				
					nano ~/.bashrc
				
			

Add the following lines.

				
					export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
				
			

Save and close the file when you are done. Then, activate the Hadoop environment variable using the following command.

				
					source ~/.bashrc
				
			

Here, in this step,verify the Hadoop installation using the following command.

				
					hadoop version
				
			

You should see the Hadoop version on the following screen.

Configure Apache Hadoop Cluster

Now, at this point, edit the Hadoop environment variable file on all nodes and define your Java installation path:

				
					nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
				
			

Change the following line.

				
					export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
				
			

On the master node, create a namenode and datanode directory using the following command.

				
					mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
				
			

Again edit the core-site.xml file on the master node and define your system hostname.

				
					nano $HADOOP_HOME/etc/hadoop/core-site.xml
				
			

Change the following lines.

				
					<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://masternode:9000</value>
</property>
</configuration>
				
			

Then, edit the hdfs-site.xml file on the master node and define your namenode and datanode directory.

				
					nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
				
			

Change the following lines that match your directory path.

				
					
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

				
			

Now, edit the workers file on the master node and define your data nodes.

				
					nano $HADOOP_HOME/etc/hadoop/workers
				
			

Add the following lines.

				
					datanode1
datanode2
				
			

Save and close the file. Then, copy all configurations from the master node to both data nodes.

				
					scp $HADOOP_HOME/etc/hadoop/* hadoop@datanode1:$HADOOP_HOME/etc/hadoop/
scp $HADOOP_HOME/etc/hadoop/* hadoop@datanode2:$HADOOP_HOME/etc/hadoop/
				
			

Format Namenode and Start Hadoop

At this point, Hadoop is installed and configured on all nodes. Now, you need to format the namenode with hdfc file system.

On the master node, format the Hadoop namenode using the following command.

				
					hdfs namenode -format
				
			

Once the namenode is formatted, you should see the following screen.

Next, start Hadoop DFS service using the following command.

				
					start-dfs.sh
				
			

If everything is fine, you should see the following output.

				
					Starting namenodes on [masternode]
Starting datanodes
Starting secondary namenodes [masternode]
				
			

Next, run the following command on the master node to verify the status of the Hadoop cluster.

				
					jps
				
			

You should see the following screen.

Now, run the following command on both data nodes to verify the cluster.

				
					jps
				
			

You should see the following screen.

Next, edit the yarn-site.xml file on both datanode.

				
					nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
				
			

Define your master node hostname.

				
					<property>
<name>yarn.resourcemanager.hostname</name>
<value>masternode</value>
</property>

				
			

Save and close the file. Then, run the following command on the master node.

				
					start-yarn.sh
				
			

Access Hadoop Namenode

Now, open your web browser and access the Hadoop namenode using the URL http://hadoop-master-ip:9870. You should see the following screen.

Click on the datanode, you will see both data nodes on the following screen.

Also access the Hadoop cluster page using the URL http://hadoop-master-ip:8088/cluster. See the following screen.

Thank you for reading How to Setup Hadoop Cluster (Multi Node on Ubuntu). We shall conclude the article now. 

How to Setup Hadoop Cluster (Multi Node on Ubuntu) Conclusion

Summarizing, Hadoop is an open source software framework that has completely transformed the way how large data sets are handled, processed, or analysed. One of the highly scalable, flexible, and cost effective solutions with excellent data security features. 

Finally, it supports various features that make it a great option for businesses looking for data processing software. Further, using Hadoop, you gain quick insights and make informed decisions as well as stay ahead of your competitors.

Should you want to explore more about Hadoop, please navigate to our blog over here

Avatar for Hitesh Jethva
Hitesh Jethva

I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.

5 1 vote
Article Rating
Subscribe
Notify of
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x