Setup and install Apache Spark on Ubuntu 24.04 / 22.04 in Azure, AWS or Google GCP. Deploy using the marketplace images below for 2 click deployment to fully setup Apache Spark. Apache Spark is an open-source, unified analytics engine designed for large-scale data processing, enabling fast, in-memory computations across batch, streaming, and machine learning workloads.
Cloud Apache Spark Self-Hosted IaaS
Apache Spark Azure
Deploy Apache Spark on Ubuntu 24.04 in Azure
Deploy Apache Spark on Ubuntu 22.04 in Azure
Apache Spark AWS
Coming soon…
Apache Spark GCP
Coming soon…
Getting Started with Apache Spark
Once your Apache Spark server has been deployed, the following links explain how to connect to a Linux VM:
Cluster Mode: Typically accessible from the master node, with ports depending on the cluster manager.
Apache Spark Environment Variables
By default the following environment variables have already been configured:
$SPARK_HOME = “/opt/spark”
$YSPARK_PYTHON = “/usr/bin/python3”
$PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin”
Basic Apache Spark Commands
The following table lists the basic commands for starting and stopping the Apache Spark (driver) master server and workers in a single-machine setup.
These scripts are to be run from the following directory:/opt/spark/sbin
Command
Description
start-master.sh
Start the driver (master) server instance on the current machine.
stop-master.sh
Stop the driver (master) server instance on the current machine.
start-worker.sh spark://master_server:port
Start a worker process and connect it to the master server (use the master’s IP or hostname).
stop-worker.sh
Stop a running worker process.
start-all.sh
Start both the driver (master) and worker instances.
stop-all.sh
Stop all the driver (master) and worker instances.
The start-all.sh and stop-all.sh commands work for single-node setups, but in multi-node clusters, you must configure passwordless SSH login on each node. This allows the master server to control the worker nodes remotely.
Firewall Ports
Service
Default Port
Configuration Option
Spark Application Web UI
4040
spark.ui.port
Spark Master Web UI
8080
spark.master.ui.port
Spark Worker Web UI
8081
spark.worker.ui.port
Block Manager
5000–5005
spark.blockManager.port
Driver Communication
Random
spark.driver.port
Executor Communication
Random
spark.executor.port
Spark History Server
18080
spark.history.ui.port
Spark REST Server (Standalone Mode)
6066
spark.master.rest.port
These ports can be adjusted in Spark’s configuration files (spark-defaults.conf or spark-env.sh) as needed.
The links below explain how to modify / create firewall rules depending on which cloud platform you are using.
Example Applications: Try running the example applications in $SPARK_HOME/examples for more complex use cases.
If you have any issues getting Apache Spark up and running from our image please contact us.
Disclaimer: Apache Spark™ is a trademark of the Apache Software Foundation (ASF) and is licensed under Apache License 2.0. This image is provided & maintained by Cloud Infrastructure Services & is not affiliated with, endorsed by, or sponsored by any company. Any trademarks, service marks, product names, or named features are assumed to be the property of their respective owners. The use of these trademarks does not imply any relationship or endorsement unless explicitly stated.
Cloud Solution Architect. Helping customers transform their business to the cloud. 20 years experience working in complex infrastructure environments and a Microsoft Certified Solutions Expert on everything Cloud.