Create Apache Spark Docker Container using Docker-Compose

Create Apache Spark Docker Container using Docker-Compose. In this post, we introduce Apache Spark, its advantages then show you how to install Apache Spark using Docker Compose.

Most industries use Hadoop for analysing large data sets but it is based on a simple programming model. Mainly, it focuses on processing massive datasets fast in terms of query response times and programme execution times. Whereas, Apache Spark has its own cluster management computation and uses Hadoop only for storage purposes.

Let’s start with Create Apache Spark Docker Container using Docker-Compose.

What is Apache Spark?

Image Source: AgileLab

All in all, Apache Spark is an open source, distributed computing framework used for large data workloads, batch processing, and machine learning. In particular, Spark is a fast and developer friendly data processing tool from the Apache Software Foundation designed to enhance data intensive application performance. In short, it provides high level APIs in Java, Scala, Python, etc. to make it easy to build applications across a wide spectrum of use cases.

In short, Apache Spark is a fast, versatile engine for processing data at scale. Spark also takes some of the programming burden for these tasks off the developer’s shoulders, thanks to a simple to use API that abstracts out a lot of the grunt work in distributed computing and large scale data processing.

Many banks, gaming companies, government officials, tech giants, and telecommunication companies use Apache Spark for it supports SQL and helps in streaming data, batch processing, and machine learning.

Spark Features

Basically with Spark architecture is based on the notion of resilient distributed data sets (RDDs), which are immutable collections of data that is distributed amongst a cluster of machines. Users can apply Apache Spark to perform massive scale data transformations and analytics, followed by cutting edge machine learning algorithms and graph computing applications.

Besides, Spark is an in memory processing engine. hence, that means that the entire dataset is loaded into memory before any computations are executed on it. Thanks to that, this allows Spark to execute operations much faster than a SQL only engine because there is no need to read the data from disk each time the query needs to be executed.

The other thing about Spark is that it works with both structured and unstructured data types making it more versatile than most SQL only engines out there.

Clearly, Apache offers popular language bindings for Python and R, Java, and Scala. So app developers and data scientists alike take advantage of a tool that scales, as well as reliable performance and speed, without having to dig through lots of details.

Advantages of Apache Spark

Image Source: AppInventive

Certainly, Apache Spark has several advantages over other Apache Hadoop components. Have a look at some of the top benefits of Apache Spark:

  • Speed and Performance: Apache Spark app development is popular, in part, because Spark executes tasks as fast as a hundred times faster than MapReduce when applied to multi stage jobs. Performance allows Apache Spark to push through multi stage processing cycles, such as the ones used by Sparks predecessor, up to 100 times faster. Thanks to the capabilities of in memory computing, Apache Spark is capable of running up to 100 times faster than Hadoop’s MapReduce. 
  • Supports Multiple Programming Languages: Apache Spark supports several commonly used programming languages (Python, Java, Scala, and R), and runs on anything from a laptop to a cluster of thousands of servers.
  • Used for Generic Purposes: The generic part means you can use it for a variety of things, such as running distributed SQL, building data pipelines, inputting data to the database, and more
  • Performs memory based computations: A core feature of Spark is clustered Spark computations in memory, which improves application throughput. It offers features for performing memory based computations in a fault tolerant fashion in larger clusters. These are immutable (read only) collections of objects of various types that are calculated across different nodes in any given cluster.
  • Offers Advanced Analytics: Apart from supporting MapReduce, the developer friendly Spark also supports Graph algorithms, SQL queries, Machine learning, and streaming data.

How to Create Apache Spark Docker Container using Docker-Compose

In this section, we will show you how to create Apache Spark Docker container using Docker Compose.

Step 1 - Install Required Dependencies

Before starting, it is always a good idea to update all system packages to the updated version. Update all of them by running the following command:

				
					apt update -y
apt upgrade -y

				
			

After updating all system packages, run the following command to install other required dependencies:

				
					apt install apt-transport-https ca-certificates curl software-properties-common -y
				
			

Once you are done, then proceed to install Docker and Docker Compose.

Step 2 - Install Docker and Docker Compose

The latest version of Docker is not available in the Ubuntu default repository. So you will need to install it from the Docker’s official repository.

First, import the Docker GPG key using the following command:

				
					curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
				
			

Next, add the Docker repository using the following command:

				
					add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
				
			

Once the repository is added, you can install Docker and Docker Compose using the following command:

				
					apt install docker-ce docker-compose -y
				
			

After the successful installation, verify the Docker version using the following command:

				
					docker --version
				
			

Now see the Docker version in the following output:

				
					Docker version 20.10.21, build baeda1f
				
			

Next, start the Docker service and enable it to start at system reboot:

				
					systemctl start docker
systemctl enable docker
				
			

At this point, both Docker and Docker Compose is installed in your system. You can now proceed to the next step.

Step 3 - Create Dockerfile for Apache Spark

Here, you will need to create a Dockerfile to define Apache Spark image for both master and worker node. You can create it with the following command:

				
					nano Dockerfile
				
			

Add the following code:

				
					# builder step used to download and configure spark environment
FROM openjdk:11.0.11-jre-slim-buster as builder

# Add Dependencies for PySpark
RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy

RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1

# Fix the value of PYTHONHASHSEED
# Note: this is needed when you use Python 3.3 or greater
ENV SPARK_VERSION=3.0.2 \
HADOOP_VERSION=3.2 \
SPARK_HOME=/opt/spark \
PYTHONHASHSEED=1

# Download and uncompress spark from the apache archive
RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
&& mkdir -p /opt/spark \
&& tar -xf apache-spark.tgz -C /opt/spark --strip-components=1 \
&& rm apache-spark.tgz


# Apache spark environment
FROM builder as apache-spark

WORKDIR /opt/spark

ENV SPARK_MASTER_PORT=7077 \
SPARK_MASTER_WEBUI_PORT=8080 \
SPARK_LOG_DIR=/opt/spark/logs \
SPARK_MASTER_LOG=/opt/spark/logs/spark-master.out \
SPARK_WORKER_LOG=/opt/spark/logs/spark-worker.out \
SPARK_WORKER_WEBUI_PORT=8080 \
SPARK_WORKER_PORT=7000 \
SPARK_MASTER="spark://spark-master:7077" \
SPARK_WORKLOAD="master"

EXPOSE 8080 7077 6066

RUN mkdir -p $SPARK_LOG_DIR && \
touch $SPARK_MASTER_LOG && \
touch $SPARK_WORKER_LOG && \
ln -sf /dev/stdout $SPARK_MASTER_LOG && \
ln -sf /dev/stdout $SPARK_WORKER_LOG

COPY start-spark.sh /

CMD ["/bin/bash", "https://net.cloudinfrastructureservices.co.uk/start-spark.sh"]

				
			

Save and close the file when you are finish.

The above configuration will do the following:

  • Install Java and other dependencies.
  • Download the latest version of Apache spark and extract it to the /opt directory.
  • Create environment variable for Apache Spark master and worker node.
  • Expose the ports 8080, 7077 and 6066.
  • Run the start-spark.sh script.

Concurrently, next step is to create a start-spark.sh file which we have defined in the above file:

				
					nano start-spark.sh
				
			

Please, add the following configuration:

				
					#!/bin/bash
. "https://net.cloudinfrastructureservices.co.uk/opt/spark/bin/load-spark-env.sh"
# When the spark work_load is master run class org.apache.spark.deploy.master.Master
if [ "$SPARK_WORKLOAD" == "master" ];
then

export SPARK_MASTER_HOST=`hostname`

cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG

elif [ "$SPARK_WORKLOAD" == "worker" ];
then
# When the spark work_load is worker run class org.apache.spark.deploy.master.Worker
cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.worker.Worker --webui-port $SPARK_WORKER_WEBUI_PORT $SPARK_MASTER >> $SPARK_WORKER_LOG

elif [ "$SPARK_WORKLOAD" == "submit" ];
then
    echo "SPARK SUBMIT"
else
    echo "Undefined Workload Type $SPARK_WORKLOAD, must specify: master, worker, submit"
fi
				
			

Save and close the file when you are done. Next, run the following command to build the Apache Spark image.

				
					docker build -t cluster-apache-spark:3.0.2 .
				
			

You should see the following output:

				
					 ---> Running in 643f406b9920
Removing intermediate container 643f406b9920
 ---> 47097ac88c1a
Step 9/12 : EXPOSE 8080 7077 6066
 ---> Running in 4987c51e99af
Removing intermediate container 4987c51e99af
 ---> 5164f30db28b
Step 10/12 : RUN mkdir -p $SPARK_LOG_DIR && touch $SPARK_MASTER_LOG && touch $SPARK_WORKER_LOG && ln -sf /dev/stdout $SPARK_MASTER_LOG && ln -sf /dev/stdout $SPARK_WORKER_LOG
 ---> Running in 636c94c3035d
Removing intermediate container 636c94c3035d
 ---> e1e057d85b0a
Step 11/12 : COPY start-spark.sh /
 ---> b04caed7b9d7
Step 12/12 : CMD ["/bin/bash", "https://net.cloudinfrastructureservices.co.uk/start-spark.sh"]
 ---> Running in 3677d0d8abdb
Removing intermediate container 3677d0d8abdb
 ---> e9f89d95c667
Successfully built e9f89d95c667
Successfully tagged cluster-apache-spark:3.0.2

				
			

Once you are finished you can proceed to the next step.

Step 4 - Create a Docker Compose File for Apache Spark

in this step, you will need to create a Docker Compose file to create and run the Apache Spark container. Create it with the following command:

				
					nano docker-compose.yml
				
			

Now add the following lines:

				
					version: "3.3"
services:
  spark-master:
    image: docker.io/bitnami/spark:3.3
    ports:
      - "9090:8080"
      - "7077:7077"
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
    environment:
      - SPARK_LOCAL_IP=spark-master
      - SPARK_WORKLOAD=master
  spark-worker-a:
    image: docker.io/bitnami/spark:3.3
    ports:
      - "9091:8080"
      - "7000:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-a
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
  spark-worker-b:
    image: docker.io/bitnami/spark:3.3
    ports:
      - "9092:8080"
      - "7001:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-b
    volumes:
        - ./apps:/opt/spark-apps
        - ./data:/opt/spark-data
  demo-database:
    image: postgres:11.7-alpine
    ports: 
      - "5432:5432"
    environment: 
      - POSTGRES_PASSWORD=securepassword

				
			

Then, save and close the file when you are done. Then, run the following command to launch the Apache Spark container.

				
					docker-compose up -d
				
			

You should see the following output:

				
					Pulling spark-master (docker.io/bitnami/spark:3.3)...
3.3: Pulling from bitnami/spark
9dce2fae8330: Pull complete
2f16a53695ed: Pull complete
Digest: sha256:fb8ff4a361bbf6eb1d213f4eca862a33d6d2506b138f4ec2ba106e968cde2118
Status: Downloaded newer image for bitnami/spark:3.3
Pulling demo-database (postgres:11.7-alpine)...
11.7-alpine: Pulling from library/postgres
cbdbe7a5bc2a: Pull complete
b52a8a2ca21a: Pull complete
e36a19831e31: Pull complete
f1aa26821845: Pull complete
412d098142b4: Pull complete
75d5ef10726d: Pull complete
ae3b5a8bbf62: Pull complete
e2f290791a5c: Pull complete
187b81308ed8: Pull complete
Digest: sha256:77fcd2c7fceea2e3b77e7a06dfc231e70d45cad447e6022346b377aab441069f
Status: Downloaded newer image for postgres:11.7-alpine
Creating root_demo-database_1 ... done
Creating root_spark-master_1  ... done
Creating root_spark-worker-b_1 ... done
Creating root_spark-worker-a_1 ... done

				
			

Verify all the running container, so please run the following command:

				
					docker ps
				
			

You should see the following output:

				
					CONTAINER ID   IMAGE                  COMMAND                  CREATED              STATUS              PORTS                                                                                  NAMES
a29b1cf3062f   bitnami/spark:3.3      "/opt/bitnami/script…"   About a minute ago   Up About a minute   0.0.0.0:7000->7000/tcp, :::7000->7000/tcp, 0.0.0.0:9091->8080/tcp, :::9091->8080/tcp   root_spark-worker-a_1
8d5e6efb9c44   bitnami/spark:3.3      "/opt/bitnami/script…"   About a minute ago   Up About a minute   0.0.0.0:7001->7000/tcp, :::7001->7000/tcp, 0.0.0.0:9092->8080/tcp, :::9092->8080/tcp   root_spark-worker-b_1
bd1bd28315ea   bitnami/spark:3.3      "/opt/bitnami/script…"   About a minute ago   Up About a minute   0.0.0.0:7077->7077/tcp, :::7077->7077/tcp, 0.0.0.0:9090->8080/tcp, :::9090->8080/tcp   root_spark-master_1
ebf78d5fed73   postgres:11.7-alpine   "docker-entrypoint.s…"   About a minute ago   Up About a minute   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp                                              root_demo-database_1
				
			

And to verify the downloaded images, run the following command:

				
					docker images
				
			

You should see the following output:

				
					REPOSITORY             TAG                       IMAGE ID       CREATED         SIZE
cluster-apache-spark   3.0.2                     e9f89d95c667   3 minutes ago   1.16GB
bitnami/spark          3.3                       a5187599fe89   2 days ago      1.23GB
openjdk                11.0.11-jre-slim-buster   f1d5c8a9bc51   17 months ago   220MB
postgres               11.7-alpine               36ff18d21807   2 years ago     150MB
				
			

Once you are finished, you can proceed to the next step.

Step 5 - Access Apache Spark

At this point, Apache spark is installed and running. Now, open your web browser and access the Apache Spark master using the URL http://your-server-ip:9090. You should see the Apache Spark master on the following screen:

If you have any problem to access any of the node. You can check the container logs using the following command:

				
					docker-compose logs
				
			

You should see the following output:

				
					spark-worker-b_1  |  11:32:01.78 INFO  ==> ** Starting Spark in master mode **
spark-worker-b_1  | rsync from spark://spark-master:7077
spark-worker-b_1  | /opt/bitnami/spark/sbin/spark-daemon.sh: line 177: rsync: command not found
spark-worker-b_1  | starting org.apache.spark.deploy.master.Master, logging to /opt/bitnami/spark/logs/spark--org.apache.spark.deploy.master.Master-1-8d5e6efb9c44.out
spark-worker-b_1  | Spark Command: /opt/bitnami/java/bin/java -cp /opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 8d5e6efb9c44 --port 7077 --webui-port 8080
spark-worker-b_1  | ========================================
spark-worker-b_1  | Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
spark-worker-b_1  | 22/11/24 11:32:08 INFO Master: Started daemon with process name: 44@8d5e6efb9c44
spark-worker-b_1  | 22/11/24 11:32:08 INFO SignalUtils: Registering signal handler for TERM
spark-worker-b_1  | 22/11/24 11:32:08 INFO SignalUtils: Registering signal handler for HUP
spark-worker-b_1  | 22/11/24 11:32:08 INFO SignalUtils: Registering signal handler for INT
spark-worker-b_1  | 22/11/24 11:32:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
spark-worker-b_1  | 22/11/24 11:32:11 INFO SecurityManager: Changing view acls to: spark
spark-worker-b_1  | 22/11/24 11:32:11 INFO SecurityManager: Changing modify acls to: spark
spark-worker-b_1  | 22/11/24 11:32:11 INFO SecurityManager: Changing view acls groups to: 
spark-worker-b_1  | 22/11/24 11:32:11 INFO SecurityManager: Changing modify acls groups to: 
spark-worker-b_1  | 22/11/24 11:32:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-b_1  | 22/11/24 11:32:13 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
spark-worker-b_1  | 22/11/24 11:32:13 INFO Master: Starting Spark master at spark://8d5e6efb9c44:7077
spark-worker-b_1  | 22/11/24 11:32:13 INFO Master: Running Spark version 3.3.1
spark-worker-b_1  | 22/11/24 11:32:14 INFO Utils: Successfully started service 'MasterUI' on port 8080.
spark-worker-b_1  | 22/11/24 11:32:14 INFO MasterWebUI: Bound MasterWebUI to spark-worker-b, and started at http://8d5e6efb9c44:8080
spark-worker-b_1  | 22/11/24 11:32:15 INFO Master: I have been elected leader! New state: ALIVE

				
			

To stop the Apache Spark, run the following command:

				
					docker-compose down
				
			

This will stop all containers as shown below:

				
					Stopping root_spark-worker-a_1 ... done
Stopping root_spark-worker-b_1 ... done
Stopping root_spark-master_1   ... done
Stopping root_demo-database_1  ... done
Removing root_spark-worker-a_1 ... done
Removing root_spark-worker-b_1 ... done
Removing root_spark-master_1   ... done
Removing root_demo-database_1  ... done
Removing network root_default

				
			

Thank you for reading how to Create Apache Spark Docker Container using Docker-Compose. We shall conclude. 

Create Apache Spark Docker Container using Docker-Compose Conclusion

In this post, we explained how to create Apache Spark Docker container using Docker Compose. Lastly, Apache Spark is a data processing framework that was designed to make it easier to work with large datasets. It’s not just a SQL only engine, it’s a lot more than that.

The developer friendly Apache Spark supports multiple programming languages and APIs. By providing bindings for popular data analytics languages such as Python and R, and more. Apache Spark allows everyone from app developers to data scientists to take advantage of its scale and speed in a cost effective way.

Please see more of Apache web content here. For Docker knowledge, please navigate to the section of our blog over here

Avatar for Hitesh Jethva
Hitesh Jethva

I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.

2.6 5 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x