Setup Apache Spark on Ubuntu in Azure/AWS/GCP

Setup and install Apache Spark on Ubuntu 24.04 / 22.04 in Azure, AWS or Google GCP.  Deploy using the marketplace images below for 2 click deployment to fully setup Apache Spark. Apache Spark is an open-source, unified analytics engine designed for large-scale data processing, enabling fast, in-memory computations across batch, streaming, and machine learning workloads.

Cloud Apache Spark Self-Hosted IaaS

Apache Spark Azure

Apache spark azure

Deploy Apache Spark on Ubuntu 24.04 in Azure

 

Apache spark ubuntu

Deploy Apache Spark on Ubuntu 22.04 in Azure

Apache Spark AWS

Coming soon…

Apache Spark GCP

Coming soon…

Getting Started with Apache Spark

Once your Apache Spark server has been deployed, the following links explain how to connect to a Linux VM:

 

 

Once connected and logged in, the following section explains how to start using Spark.

Start Apache Spark Shell

The Spark shell provides an interactive environment for working with Spark in Scala (default) or Python (via pyspark):

 

  • Scala Shell:
				
					/opt/spark/bin/spark-shell

				
			
  • Python Shell (PySpark):
				
					/opt/spark/bin/pyspark

				
			

Use this to explore Spark’s capabilities by running commands interactively.

Start Spark Master and Worker Instances

Run the following command to Start both the driver (master) and worker instances.

				
					/opt/spark/sbin/start-all.sh
				
			

Enter your password when prompted.  Now Apache Spark is running.

Access Spark Web UI

When running jobs, Spark provides a Web UI to monitor job status, available at:

 

  • Standalone Mode: http://ServerIPAddress:8080
  • Cluster Mode: Typically accessible from the master node, with ports depending on the cluster manager.
Apache Spark Web UI

Apache Spark Environment Variables

By default the following environment variables have already been configured:

 

  • $SPARK_HOME = “/opt/spark”
  • $YSPARK_PYTHON = “/usr/bin/python3”
  • $PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin”

Basic Apache Spark Commands

The following table lists the basic commands for starting and stopping the Apache Spark (driver) master server and workers in a single-machine setup. 

 

These scripts are to be run from the following directory: /opt/spark/sbin

CommandDescription
start-master.shStart the driver (master) server instance on the current machine.
stop-master.shStop the driver (master) server instance on the current machine.
start-worker.sh spark://master_server:portStart a worker process and connect it to the master server (use the master’s IP or hostname).
stop-worker.shStop a running worker process.
start-all.shStart both the driver (master) and worker instances.
stop-all.shStop all the driver (master) and worker instances.

The start-all.sh and stop-all.sh commands work for single-node setups, but in multi-node clusters, you must configure passwordless SSH login on each node. This allows the master server to control the worker nodes remotely.

Firewall Ports

ServiceDefault PortConfiguration Option
Spark Application Web UI4040spark.ui.port
Spark Master Web UI8080spark.master.ui.port
Spark Worker Web UI8081spark.worker.ui.port
Block Manager5000–5005spark.blockManager.port
Driver CommunicationRandomspark.driver.port
Executor CommunicationRandomspark.executor.port
Spark History Server18080spark.history.ui.port
Spark REST Server (Standalone Mode)6066spark.master.rest.port

 

These ports can be adjusted in Spark’s configuration files (spark-defaults.conf or spark-env.sh) as needed. 

The links below explain how to modify / create firewall rules depending on which cloud platform you are using.

 

To setup AWS firewall rules refer to – AWS Security Groups

To setup Azure firewall rules refer to – Azure Network Security Groups

To setup Google GCP firewall rules refer to – Creating GCP Firewalls

Support / Documentation

 

If you have any issues getting Apache Spark up and running from our image please contact us.

Disclaimer: Apache Spark™ is a trademark of the Apache Software Foundation (ASF) and is licensed under Apache License 2.0. This image is provided & maintained by Cloud Infrastructure Services & is not affiliated with, endorsed by, or sponsored by any company. Any trademarks, service marks, product names, or named features are assumed to be the property of their respective owners. The use of these trademarks does not imply any relationship or endorsement unless explicitly stated.

Avatar for Andrew Fitzgerald
Andrew Fitzgerald

Cloud Solution Architect. Helping customers transform their business to the cloud. 20 years experience working in complex infrastructure environments and a Microsoft Certified Solutions Expert on everything Cloud.

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x