How to Install Hadoop on Ubuntu 22.04 / 20.04

How to Install Hadoop on Ubuntu 22.04 / 20.04. Would you require to store and process large datasets (gigabytes to petabytes) and do not want to use large computers to store and process the data. Here is the solution. Hadoop allows you to cluster multiple computers to analyse large datasets in parallel quickly. This guide introduces what Hadoop is, its major advantages, then shows you steps how to install it on Ubuntu 22.04 / 20.04.

What is Hadoop?

Primarily, Hadoop is an open source software designed and created to handle and process enormous amounts of data, serving as inspiration for the project. Many businesses, including governmental organisations and academic institutions, employ Hadoop. It detects fraud, consumer segmentation, and recommendation engines and is also used in various industries, including finance, healthcare, and retail. Moreover, Hadoop provides many global features that benefit you and corporations, which we  discuss in this blog.

Advantages of Hadoop

Individuals and organisations seeking to process and analyse massive amounts of data have access to several benefits from the powerful and adaptable Hadoop platform. Here are a few benefits of Hadoop:

Scalability

By distributing the data and processing it across a cluster of nodes, Hadoop is highly scalable and capable of handling petabytes of data. More nodes are added to the cluster as data increases, offering nearly linear scalability.

Flexibility

Data that is structured, semi structured, or unstructured are  all handled by Hadoop. Moreover, you may integrate it with many different data sources, including data warehouses, relational databases, and NoSQL databases.

Speed

What is more, Hadoop offers quicker processing times than conventional systems since it processes data simultaneously across several nodes. It also stores data in a distributed file system, enabling more immediate access to the data.

Cost effective

Additionally, Hadoop allows users to download it free and use it as open source software. It is a cost effective solution for businesses since it operates hardware less expensive than standard corporate hardware.

Resilience

Extremely resilient solution. In the case of a hardware failure, it may automatically replicate data across numerous nodes to ensure no data is lost. Moreover, it has real time node failure detection and management capabilities.

Analytics

Several analytics use cases, including data warehousing, data mining, and machine learning, are supported by Hadoop. Furthermore, it offers many tools for data analysis, including Apache Mahout (Machine learning), Apache Hive (SQL like queries), and Apache Pig (data processing).

How Powerful Are Hadoop Security Systems?

To protect data processed in the Hadoop ecosystem and stored in the Hadoop Distributed File System (HDFS), Hadoop offers several security mechanisms. Here are a few security aspects offered by Hadoop:

Authentication

To guarantee that only authorised users access data stored in HDFS or processed in the Hadoop environment. Also, Hadoop supports several authentication methods, including Kerberos and LDAP.

Authorization

To enable granular access control to data stored in HDFS or processed within the Hadoop environment, Hadoop enables Access Control Lists (ACLs) and Role Based Access controls (RBAC). This guarantees that users only access the information to which they have been granted access.

Encryption

Supports data encryption both in transit and at rest. Hadoop Transparent Data Encryption (TDE) essentially encrypts the data on a disc using commercially accepted encryption techniques and encrypts data at rest. With Transport Layer Security (TLS) or Secure Sockets Layer (SSL), data in transit is encrypted (TLS).

Auditing

To monitor user activity and modifications to data stored in HDFS or processed inside the Hadoop environment, auditing tools are provided by Hadoop. This enables businesses to find security issues and respond appropriately.

Adaptation to different security instruments

Provide a layered security approach, Hadoop is connected with additional security solutions like firewalls, intrusion detection systems, and security information and event management (SIEM) systems.

We have arrived to the main part of the article How to Install Hadoop on Ubuntu 22.04 / 20.04

How to Install Hadoop on Ubuntu 22.04 / 20.04

In this section, we show you installation steps on Ubuntu 22.04 / 20.04

Prerequisites

  • An Ubuntu 22.04 or 20.04 server is installed on your system.
  • A root user or a user with sudo privileges

Install Java JDK

Hadoop is a Java based software. So you need to install Java JDK and JRE on your server. Install them with the following command.

				
					apt install default-jdk default-jre -y
				
			

After installing Java, you check the Java version using the following command.

				
					java --version
				
			

You should see the Java version on the following output.

				
					openjdk 11.0.17 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu222.04, mixed mode, sharing)
				
			

Next, you need to find the Java install location in your server. Find it with the following command.

				
					dirname $(dirname $(readlink -f $(which java)))
				
			

This shows you the Java installation path in the following output.

				
					/usr/lib/jvm/java-11-openjdk-amd64
				
			

Create a Hadoop User and Generate SSH Key Pair

It is recommended to create a dedicated user to run Hadoop. You create it with the following command.

				
					adduser hadoop
				
			

Set your user’s password as shown below.

Next, add the Hadoop user to the sudo group using the following command.

				
					usermod -aG sudo hadoop
				
			

Now, switch the user to Hadoop and generate an SSH key pair with the following command.

				
					su - hadoop
ssh-keygen -t rsa

				
			

You should see the following screen.

Next, add the generated public key to the authorized_keys and set proper permission with the following command.

				
					cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
				
			

Now, verify the SSH password less connection using the following command.

				
					ssh localhost
				
			

You should see the following screen.

Download and Install Hadoop

First, go to the Hadoop official download page, pick the latest available version, and download it with the following command.

				
					wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
				
			

Once the Hadoop is downloaded, extract the downloaded file using the following command.

				
					tar xzf hadoop-3.3.4.tar.gz
				
			

Next, rename the extracted directory to hadoop.

				
					mv hadoop-3.3.4 hadoop
				
			

Now, edit the ~/.bashrc file and add the Hadoop environment variables.

				
					nano ~/.bashrc
				
			

Add the following lines.

				
					export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
				
			

Save and close the file when you are done. Then, activate the Hadoop environment variable using the following command.

				
					source ~/.bashrc
				
			

Now verify the Hadoop installation using the following command.

				
					hadoop version
				
			

This shows you the Hadoop version on the following screen.

Configure Hadoop

Firstly, edit the Hadoop environment variable file and define your Java installation path:

				
					nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
				
			

Change the following lines.

				
					export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
				
			

Save and close the file. Then, create a namenode and datanode directory using the following command.

				
					mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
				
			

Next, edit the core-site.xml fine and define your system hostname.

				
					nano $HADOOP_HOME/etc/hadoop/core-site.xml
				
			

Change the following lines.

				
					<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

				
			

Then, edit the hdfs-site.xml file and define your namenode and datanode directory.

				
					nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
				
			

Change the following lines that match your directory path.

				
					<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

				
			

Next, edit the file mapred-site.xml.

				
					nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
				
			

Make the following changes.

				
					<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
				
			

Next, edit the yarn-site.xml file.

				
					nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
				
			

Change the following lines.

				
					<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
				
			

Format Namenode and Start Hadoop

At this point, Hadoop is installed and configured. Now, you need to format the namenode with hdfc file system. You format the Hadoop namenode using the following command.

				
					hdfs namenode -format
				
			

Once the namenode is formated, you should see the following screen.

Next, start all Hadoop services using the following command.

				
					start-all.sh
				
			

You should see the following screen.

To check the Hadoop listening ports, run the following command.

				
					ss -antpl | grep java
				
			

You should see all listening ports in the following screen.

Access Hadoop Web UI

At this point, Hadoop namenode and datanode are installed and configured. Now, you access the Hadoop namenode using the URL http://your-server-ip:9870. You should see the following screen.

To access the Hadoop datanode, visit the URL http://your-server-ip:9864. You should see the datanode web interface on the following screen.

To access the Hadoop application page, use the URL http://your-server-ip:8088. You should see the application page on the following screen.

Thank you for reading How to Install Hadoop on Ubuntu 22.04 / 20.04. We shall conclude this article now.

How to Install Hadoop on Ubuntu 22.04 / 20.04 Conclusion

In this post, we showed you how to install Hadoop on Ubuntu 22.04 / 20.04. We also explained how to start Hadoop and then access it via the web browser. Individuals and organisations benefit from Hadoop’s scalability, affordability, flexibility, speed, etc. It is significant to remember that the configuration of the Hadoop cluster and the deployment of these security elements determine the security of Hadoop. Businesses must ensure they adhere to established practices for protecting their Hadoop cluster, such as using strong passwords, limiting access and promptly implementing security fixes.

Avatar for Hitesh Jethva
Hitesh Jethva

I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x