How to Install Apache Spark on Ubuntu 20.04 (Step by Step)

How to Install Apache Spark on Ubuntu 20.04 (Step by Step). In this guide, we introduce Apache Spark, its features, advantages then explain how to install Apache Spark on Ubuntu 20.04 server.

What Is Apache Spark?

Image Source: YouTube

Apache Spark is an open source, distributed processing system you can use for big data workloads. This system optimizes query execution and uses in memory caching against data of all sizes. Therefore, apache spark is one of the fast and general engines for large scale data processing.

Secondly, Apache Spark is considered faster as it runs on memory and makes the processing faster than other drives. Moreover, you use it for many things like running distributed SQL, ingesting data into a database, creating data pipelines, working with graphs or data streams, and Machine learning algorithms, among others.

Features of Apache Spark

The features of Apache Spark are:

  • Swift Processing – Apache Spark provides you with a high data processing speed by reducing the number of read write to disk.
  • DynamicSpark enables you to develop a parallel application with 80 high level operators.
  • In memory Computation – In memory processing enables you to increase the processing speed effortlessly. Since Spark caches data, it is essential to fetch data from the disk every time, which saves tremendous time. Its DAG execution enables in memory computation and acyclic data flow, which results in high speed.
  • Reusability – You reuse Spark for batch processing, join stream against historical data, or run ad hoc queries on stream state.
  • Real Time Stream Processing – Spark provides real time stream processing with which you can handle and process the already present data.
  • Lazy Evaluation in Apache Spark – The transformations made in Spark RDD acts lazily. It did not provide the right result, instead form a new RDD from the existing one. Hence, it increases the efficiency of the system.
  • Cost Efficient – a a cost effective solution for big data problems. It does not require a large amount of storage and a data center.
  • Spark GraphXGraphX is a component used for graph and graph parallel computation. With the help of GraphX, you acquire simplified graph analytics tasks from graph algorithms and builders.
  • Support For Sophisticated Analysis – Spark includes dedicated tools for streaming data, interactive queries, and machine learning, which add on to map and reduce.
  • Multiple Language Support – Spark support multiple languages like Java, R, Scala, and Python. It delivers dynamicity and solves the limitation of Hadoop.
  • Integrated With Hadoop – Spark tends to run independently and on Hadoop YARN Cluster Manager. It can also read existing Hadoop data, which makes the system highly flexible.

Advantages of Apache Spark

The advantages of Apache Spark are:

  • Ease Of Use – Apache Spark constitutes easy to use APIs for operating large datasets. Since it provides 80 high level operators, the process of building parallel apps becomes effortless.
  • Advanced Analytics – Spark supports ‘Map’ and ‘Reduce.’ It also supports Machine Learning, Graph Algorithms, Streaming data, And SQL queries, among others.
  • Multilingual – As discussed above, Apache Spark supports several languages, including Python, Java, Scala, etc.
  • Increased Access To Big Data – Apache Spark provides opportunities for big data. Large organizations like IBM announced educating more than one million data engineers and data scientists on Apache Spark.
  • Demand For Spark Developers – Apache Spark is not only providing benefits to your organizations but developers as well. In addition, Spark developers are so in demand that businesses are ready to invest their time and money in hiring experts skilled in Apache Spark.

Now the main part of the article How to Install Apache Spark on Ubuntu 20.04 (Step by Step).

How to Install Apache Spark on Ubuntu 20.04 (Step by Step)

In this section, we will show you how to install Apache Spark on Ubuntu 20.04.

Step 1 - Update the System

Before starting, you need to update and upgrade all system packages to the latest version. You update all of them by running the following command:

				
					apt update -y
apt upgrade -y
				
			

Once all the packages are updated, you can proceed to the next step.

Step 2 - Install Java JDK

Following step with Apache Spark is Java based application. So you need to install Java JDK on your server. You can install it by running the following command:

				
					apt-get install default-jdk -y
				
			

Once the Java JDK is installed, you can verify the Java version using the following command:

				
					java --version
				
			

Now you should see the Java version in the following output:

				
					openjdk 11.0.17 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)
				
			

Step 3 - Install Scala

Then, you also need to install Scala in your server. Install it by running the following command:

				
					apt-get install scala -y
				
			

After the installation, you can verify the Scala version with the following command:

				
					scala -version
				
			

You will get the following output:

				
					Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
				
			

Now connect to the Scala with the following command:

				
					scala
				
			

At this stage you should see the Scala shell in the following output:

				
					Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.17).
Type in expressions for evaluation. Or try :help.

scala> 
				
			

To test the Scala, run the following command:

				
					scala> println("Testing Scala")
				
			

You should get the following output:

				
					Testing Scala
				
			

Press CTRL+D to exit from the Scala shell.

Step 4 - Install Apache Spark

First, visit the Apache Spark official download page and download the latest version using the following command:

				
					wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
				
			

Once the Apache Spark is downloaded, you can extract the downloaded file with the following command:

				
					tar -xvzf spark-3.3.1-bin-hadoop3.tgz
				
			

Next, move the extracted directory to the /mnt directory with the following command:

				
					mv spark-3.3.1-bin-hadoop3 /mnt/spark
				
			

Step 5 - Start Apache Spark

In this step, you will need to edit the .bashrc file and define the Apache Spark path. Edit it with the following command:

				
					nano ~/.bashrc
				
			

Add the following lines:

				
					export SPARK_HOME=/mnt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
				
			

Save and close the file then reload the environment variable with the following command.

				
					source ~/.bashrc
				
			

Next, start the Apache Spark with the following command:

				
					start-master.sh
				
			

You should see the following output:

				
					starting org.apache.spark.deploy.master.Master, logging to /mnt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-spark.out
				
			

If you want to stop the Apache Spark, run the following command:

				
					stop-master.sh
				
			

Step 6 - Access Apache Spark Master

By default, the Apache Spark listens on port 8080. Please verify the listening port using the following command:

				
					ss -tpln | grep 8080
				
			

You should get the following output:

				
					LISTEN   0        1                            *:8080                  *:*       users:(("java",pid=40443,fd=267))                                              
				
			

Now, open your web browser and access the Apache Spark master node using the URL http://your-server-ip:8080. You should see the following screen:

Step 7 - Access Apache Spark Worker Node

Firstly here, start the Apache Spark Worker service. YStart it with the following command:

				
					start-worker.sh spark://your-server-ip:7077
				
			

You will get the following output:

				
					starting org.apache.spark.deploy.worker.Worker, logging to /mnt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-spark.out
				
			

Now, refresh the Apache Spark master node screen. You should see the added Worker node on the following screen:

If you want to stop the Worker, run the following command:

				
					stop-worker.sh
				
			

Currently, you can also see the Apache Spark logs using the following command:

				
					tail -f /mnt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-spark.out
				
			

Check out the following output:

				
					22/11/24 11:21:24 INFO Worker: Running Spark version 3.3.1
22/11/24 11:21:24 INFO Worker: Spark home: /mnt/spark
22/11/24 11:21:24 INFO ResourceUtils: ==============================================================
22/11/24 11:21:24 INFO ResourceUtils: No custom resources configured for spark.worker.
22/11/24 11:21:24 INFO ResourceUtils: ==============================================================
22/11/24 11:21:24 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
22/11/24 11:21:24 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://65.20.72.235:8081
22/11/24 11:21:24 INFO Worker: Connecting to master spark:7077...
22/11/24 11:21:24 INFO TransportClientFactory: Successfully created connection to spark/127.0.1.1:7077 after 69 ms (0 ms spent in bootstraps)
22/11/24 11:21:25 INFO Worker: Successfully registered with master spark://spark:7077
				
			

Step 8 - Create a Systemd File for Apache Spack

It is a good idea to create a systemd service file to manage Apache Spark via systemctl command. So that your Apache Spark will start automatically on every system reboot and you don’t need to start it manually.

Create a systemd service file for Apache Spark Master using the following command:

				
					nano /etc/systemd/system/spark-master.service
				
			

Add the following configurations:

				
					[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=root
Group=root
ExecStart=/mnt/spark/sbin/start-master.sh
ExecStop=/mnt/spark/sbin/stop-master.sh

[Install]
WantedBy=multi-user.target

				
			

Save and close the file after you finished. Then, create a systemd service file for Apache Worker using the following command:

				
					nano /etc/systemd/system/spark-worker.service
				
			

Add the following configurations:

				
					[Unit]

Description=Apache Spark Worker

After=network.target

[Service]
Type=forking
User=root
Group=root
ExecStart=/mnt/spark/sbin/start-slave.sh spark://your-server-ip:7077
ExecStop=/mnt/spark/sbin/stop-slave.sh

[Install]
WantedBy=multi-user.target
				
			

Save and close the file. Then, reload the systemd daemon to apply the changes.

				
					systemctl daemon-reload
				
			

Next, start both Spark Master and Worker service and enable them to start at system reboot:

				
					systemctl start spark-master
systemctl enable spark-master
systemctl start spark-worker
systemctl enable spark-worker

				
			

Now, you can now easily manage both services using the systemctl command.

Thank you for reading How to Install Apache Spark on Ubuntu 20.04 (Step by Step). We shall conclude.

How to Install Apache Spark on Ubuntu 20.04 Conclusion

In this guilde, you learned how to install Apache Spark on Ubuntu 20.04 server. You also learned how to create a systemd service file to manage Apache Spark. Lastly, Apache Spark is capable of handling multiple analytics challenges as it has low latency in memory data processing capabilities. With, well built libraries for graph analytics algorithms and machine learning.

What is more, it also provides programmers with in memory cluster computing that can be used for a variety of tasks. Further, it can be used for building data pipelines, machine learning, data streaming, running machine learning algorithms, and working with graphs. All in all, Apache Spark is hundred times faster than MapReduce and has several benefits over other Apache Hadoop components. Check the above listed benefits and easily use them for analysing large data sets.

To read more about our Apache content please navigate to our blog here. 

Avatar for Hitesh Jethva
Hitesh Jethva

I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.

4.8 4 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x