Apache Spark Architecture – Components & Applications Explained

Apache Spark Architecture – Components & Applications Explained. Apache Spark is an open source data processing framework for processing tasks on large scale datasets and running large data analytics tools. Well, it handles both data processing and real time analytics workloads. This technology framework was created by researchers at the University of California, Berkeley, in 2009 to speed up processing tasks in Hadoop structures.

Spark Core is at the heart of this data processing framework. Offers I/O, scheduling, and distributed task transmission functionalities. Thanks to that feature, it provides programmers with a more flexible and potentially quicker alternative to MapReduce, the software structure to which initial versions of Hadoop were linked.

Additionally, Apache Spark is known for its speed and flexibility. These elements are crucial in machine learning and big data, where vast amounts of computing power are used to process data. In nutshell, Apache Spark also has an easy to use API that helps developers reduce the workload of these programming tasks. 

Since its inception, Spark has become one of the world’s most important large scale distributed processing systems. Use it in several ways, including supporting machine learning, SQL, graph processing, and streaming data. Also, it provides native bindings for Scala, Java, R programming, and python languages.

Shall we start with Apache Spark Architecture – Components & Applications Explained.

How Does Apache Spark Work?

Image Source: Apache.org

Firstly, Apache Spark processes data from various repositories, including NoSQL databases, Hadoop Distributed File Systems, and relational data stores like Apache Hive. Secondly, Spark provides support for in memory processing to improve the performance of big data analytics applications. Also, Spark executes standard disk based processing when data sets are larger than the available system memory.

The Spark Core engine employs the Resilient Distributed Dataset (RDD) as its primary data type. RDD shields users from most of the computational complexities. So, it collects data and divides it across multiple servers, where it is either processed and transferred to a different data repository or passed through an analytic model.

Users don’t have to designate the storage location of particular files or the computational resources used to retrieve or store files. Therefore, the Spark Core engine operates partially as an Application Programming Interface (API) layer and reinforces related tools for analysing and managing data. Apart from the Spark Core processing engine, Apache Spark’s API environment contains several code libraries for data analytics applications. 

Components of Apache Spark

Image Source: Javatpoint.com

Here are the five major components of Apache Spark:

Spark Core

The first one of Apache Spark Architecture – Components & Applications Explained is called a Spark Core. All in all, Apache Spark’s fundamental module for structured and distributed data processing. in essence, Spark core forms the main abstraction layer for Apache. Then, it provides the major functionalities such as task scheduling, in-memory computing, fault tolerance, and I/O operations.

Spark SQL

Next component is called Spark SQL, one of the most commonly used libraries. Also, it allows users to examine data stored in different applications using the SQL language. This library has the Uniform Data Access feature that allows developers to query structured data with Spark programs using SQL or a familiar DataFrame API that is usable in Scala, Java, R, and Python. SQL and DataFrames offer a common way to connect to various data sources, like ORC, JSON, Avro, Hive Parquet, and JDBC.

Besides, Spark SQL provides Hive integration. Basically, it provides support for the HiveQL syntax, UDFs, and Hive SerDes, enabling users to run HiveQL or SQL queries on existing Hive warehouses.

Certainly, Spark SQL also features code generation, a cost based optimizer, and columnar storage to make queries speed. It also scales to multi hour queries and thousands of nodes using the Spark engine, which offers complete mid query fault tolerance. This means users don’t have to use a different engine for historical data.

Spark Streaming

Concurrently,  the next component is Spark Streaming.  A lightweight API that enables seamless data streaming and batch processing in the same application. Consequently, it leverages Spark Core’s fast scheduling capabilities to perform streaming analytics. This library accepts mini batches and RDD transformations on the data.

With Spark Streaming library, it is designed to ensure you reuse the applications written for data streaming with little modification. Data in Spark Streaming is ingested from various sources and live streams. These include IoT sensors, Apache Kafka, Amazon Kinesis, etc. Basically, Apache Streaming is ideal for applications that require real time analytics and rapid response.

MLIib

This library contains machine learning code that allows users to carry out complex statistical operations on data inside their Spark clusters and create applications around these analyses. This library is easy to use as it interoperates with NumPy in R and Python libraries. Users connect to any Hadoop data source, making it simpler to plug into Hadoop workflows.

Besides, MLlib delivers high performance thanks to its excellent iterative computing that allows it to run fast. At the same time, it contains algorithms that leverage iteration and can give better results compared to the one-pass approximation used on MapReduce.

GraphX

This is a built in library containing algorithms used for graph parallel computation. It’s a flexible library that combines iterative graph computation, exploratory analysis, and ETL in a single system. Users look at the same data as collections and graphs, write iterative graph algorithms using Pregel API and join and transform graphs with RDDs.

Therefore, GraphX is a high performance graph processor that’s also easy to use, flexible, and fault tolerant. In addition to a flexible API, GraphX provides users with multiple graph algorithms.

Now the main part of this article blog Apache Spark Architecture – Components & Applications Explained.

Apache Spark Architecture

Image Source: Edureka.com

Spark has a well layered architecture where the layers and components are loosely coupled.  There are various runtime components; drivers, cluster managers, and executors.  

Every worker node comprises one or multiple Executor (s) running the task. The Executors register themselves with Spark Diver, and Driver has access to all the information about Executors at all times. This working combination of Spark Driver and worker nodes is called a Spark application. The Spark Application launches with the help of a Cluster manager. 

Spark Driver

Driver is a Java process, where the primary method of Java, Scala, and Python programs runs. What it does, it executes the user code and builds a SparkContext or SparkSession. The SparkSession is responsible for creating DataSet, DataFrame, RDD, and executing SQL, among other functions.

Functions of Spark Driver

  • Creates SparkContext or SparkSession.
  • Converts user code into a task through Transformation and Action.
  • Assists in creating the Lineage, Physical plan, and Logical plan.
  • After generating the physical plane, the Driver coordinates with the cluster manager to schedule the execution of tasks.
  • Records metadata which was persisted/cached in Executors memory

Cluster Manager

Spark relies on the cluster managers to launch the Driver and the executors. Spark offers a “spark-submit” script that assists users in connecting with different kinds of cluster managers and manages the number of resources the application will get. It determines how many executors will launch and how much CPU and memory each Executor uses.

The cluster manager in Apache Spark distributes resources across applications. In essence, it maintains a cluster of machines that run the Spark applications. This manager has its own driver known as ‘master’ and ‘worker’ abstractions. Currently, Apache Spark supports the following cluster managers:

Yarn– This is a Hadoop general manager. It is a distributed computing structure that controls resource management and job scheduling.

Kubernetes is a containerized applications management, scaling, deployment, and automation system. It employs a native Kubernetes scheduler.

Spark Standalone Mode- Simple and easy-to-set-up cluster manager that accesses HDFS. The cluster manager is designed to be resilient and efficiently handles failures. It manages resources as per the requirements of the applications.

Mesos– This general manager can run MapReduce, Hadoop, and service applications. Since Apache Spark has API for most programming languages, users run Hadoop MapReduce, Spark jobs, and other service applications.

Executor

The Executor resides within the worker node. The executors typically launch at the start of a Spark Application with the help of a cluster manager. It runs individual tasks and returns the result to the Driver. Also, it can persist/cache data in the worker node

Apache Spark Languages

Evidently, Spark is written in Scala due to its speed. Ideally, it’s statically typed and uses a known approach to compile to the Java Virtual Machine (JVM). Although Spark has API connectors for Java, R language, Python, and Scala, most developers prefer Python and Scala. 

Java is not an optimal programming language for data science or data engineering as it does not support Read-Evaluate-Print-Loop. Although R is a domain specific language, users still download R programming packages and run them in Spark.

Apache Spark Modes of Execution

Also it is worth noting with Apache Spark Architecture – Components & Applications Explained, that Apache spark has two main modes of execution; cluster mode and client mode.

Cluster Mode

In this mode, the spark driver, which acts as the application master, launches in any of the workers. This mode is based on the Fire and Forget principle, so a user submits the application and leave after initiating the application or do some other work. Clients use this mode to reduce the network latency between the executors and the drivers, making it ideal for use in cases where an application is submitted from a machine far away from the workers.

Client Mode

In this mode, the user submitting the application will launch the Driver, and it will sustain the spark context. Until the specific job execution completes, the Driver will be in charge of handling the task. The user must stay online and maintain contact with the cluster until that job finishes.

In client mode, the entire application relies on the local machine as the Driver resides there. In case of any problem with the local machine, the Driver and, subsequently, the whole application will go off. This mode is not convenient for production use cases. However,  it is suitable for testing and debugging.

Features of Apache Spark

Image Source: Knoldus.com

Apache Spark has a variety of features that make it a handy framework for big data processing and streaming. These include:

Fault Tolerance

Thanks to its design, Apache Spark deals with worker node failures. It does this by using RDD (Resilient Distributed Datasets) and DAG (Directed Acyclic Graph). DAG has the lineage of all the transformations and actions required to complete a task. So in case a worker node fails, the results can still be reached by rerunning the process from the existing DAG.

Dynamic Nature

Apache Spark provides more than 80 high-level operators that simplify building similar apps.

Lazy Evaluation

Spark does not evaluate transformations immediately. Instead, It lazily evaluates all transformations. Spark adds the transformations to the DAG, and the final computations become available only when the Driver requests some data. This feature enables Spark to make optimized decisions, as the engine can view all the transformations before performing any action.

Real-Time Stream

Spark Streaming provides Spark’s API for processing, allowing users to write streaming jobs like they write batch jobs.

High Speed

Spark enables Hadoop applications to run up to ten times faster on disk and one hundred times faster on disk.

Reusability

Spark can join streaming data against historical data, run batch processes, and run ad-hoc queries on the streaming state.

In-Memory Computing

Image Source: Oreilly.com

Apache Spark differs from Hadoop MapReduce because it can process tasks in memory and does not need to write back intermediate computations to the disk. This feature increases the speed of spark processing. In addition, Spark persists the intermediate computations making it possible to reuse it in the next iteration.

Support for Multiple Languages

Apache spark has most of its APIs available in Scala, Java, R, and Python. Spark also comes with SparkSQL, which makes it very easy for SQL developers to use.

Integration with Hadoop

Apache integrates seamlessly with Hadoop’s HDFS file system. It supports multiple file formats such as Avro, parquet, JSON, ORC, and many others.

Apache Spark Use Cases

Here are the main use cases of Apache Spark:

Machine Learning

Mllib is Spark’s scalable machine learning library. It has several algorithms for both Supervised and unsupervised ML algorithms, which can scale out on a cluster for collaborative filtering, regression, clustering, and classification.

Mllib inter-operates with both R’s libraries and Pythons NumPy library. It allows users to apply some of these algorithms to streaming data. Mllib helps Apache Spark provide customer segmentation, sentiment analysis, and predictive intelligence.

A prevalent implementation of ML is text classification, for example, in categorizing emails. Users can instruct an ML pipeline to categorize emails by reading the inbox.

Data Streaming

Image Source: Dimensionless.in

Data Streaming in Apache Spark Architecture – Components & Applications Explained, it enables users to process and analyse streaming data from multiple sources simultaneously. The streaming data is usually unbounded and processed as obtained from the data source. Leveraging these analytics, companies can identify opportunities, make crucial decisions, and detect threats, among other capabilities.

Fog Computing

To better understand Fog computing,  it’s best to understand how IoT works. IoT connects internet-enabled devices and enables them to communicate with each other and offer solutions to devise users. This would mean that the current cloud computing may not be enough to accommodate so much data processing and data transfer.

Fog computing is ideal as it allocates the processing function to the devices on the network’s edge. This requires low latency, advanced graph analysis algorithms, and parallel processing of ML, which are all readily available in Apache Spark. Besides, you can customize these elements to suit the user’s processing requirements.

Event Detection

Apache Spark helps in real time event detection. You can use it to identify instances of money laundering and credit card fraud activities. Spark’s Streaming, Apache Kafka, and Mllib form the foundation of modern financial fraud detection.

The Apache system records cardholders’ card transactions to catalog the user’s spending habits. Models can be designed and instructed to detect any abnormality in the card transaction in conjunction with Kafka and Spark streaming in real-time.

Interactive Analytics

One of Spark’s most popular features is its ability to offer users interactive analytics. MapReduce offers tools like Hive and Pig for interactive analysis, but they are too slow. On the other hand, Spark is very fast and efficient, which is why it continues to gain ground in interactive analysis.

Spark introduced Structured Streaming in version 2.0 (which can be used for interactive analysis and join the live data with batch data output to – better understand the data). Structured Streaming has the potential to improve Web Analytics by enabling users to query live web sessions. Users can also apply Machine Learning to live session data to acquire more insight into the data.

Data Warehousing

Due to daily increases in data volume, traditional ETL tools such as RDBMS and Informatica hardly meet SLAs. This is because they are not designed to scale horizontally. Many companies prefer using Apache Spark and Spark SQL to shift to Big Data-based Warehouses, which can scale horizontally as the data volume load increases.

With Apache Spark, users scale processing horizontally by incorporating machines into the spark engine cluster. The migrated applications embed the Spark core engine and provide a web user interface to enable users to design, test, run and deploy jobs interactively. Native SQL or other versions of SQL are the main languages that write jobs. These Spark clusters can scale to process terabytes of data daily, ranging from hundreds to thousands of nodes.

Real World Applications of Apache Spark

Up next with Apache Spark Architecture – Components & Applications Explained is to explain some of the industries that use Apache Spark:

Banking and Finance

Financial institutions use Spark to access and analyse complaint logs, call recordings, emails, and social media content to improve customer experience. This enables them to make informed business decisions through targeted advertisement, customer segmentation, and credit risk assessment.

Most banks rely on Apache Spark to provide an integrated view of an organization or individual to target their products based on their usage and needs. Apache Spark’s machine learning capabilities help analyze spending habits and give predictions based on customer behavior.

Healthcare

The healthcare industry is adopting modern technologies such as machine learning and big data to provide state-of-the-art facilities to their clients. Spark is quickly gaining traction and becoming the backbone of the newest healthcare applications. Hospitals use Spark-enabled healthcare tools to analyze patient medical history to detect possible health issues based on learning and history.

Retail

Big retailers face the challenge of optimizing their supply chain to cut costs and wastage and gain insight into consumers’ shopping habits. Also, most retail companies are looking for ways to improve customer service and optimize their profits.

To solve this problem, most companies use Spark and Mllib to record sales and invoice data, ingest it, and work out the inventory. Users can also apply this technology to monitor the order’s shipment and delivery status in real time. Apache Spark MLlib’s predictive and analytics models help predict sales during sale seasons and promotions to match the inventory and prepare for the event. Companies also get to use consumers’ historical shopping habits data to offer personalized suggestions and improve the consumer experience.

Media

Media companies and content streaming services use Apache Spark in their technology engine to power their business. When users turn on a content streaming service, they can view their favorite content automatically. This is possible through recommendation engines designed using Spark Mllib and Machine Learning algorithms. Content streaming services utilize historical data from consumers’ content selection, train their Machine Learning algorithms, and deploy it live after testing it offline.

Energy

The energy sector is a major beneficiary of Apache Spark technology. Energy companies find it hard to predict energy usage patterns. These companies read millions of gas and electric meters every hour. These include:

  • Real time energy consumption data-driven.
  • Electricity and gas meter readings.
  • Thermostat temperature data driven.
  • Connected boiler data driven.

Spark Mllib applies machine learning to the data for similar home comparison, disaggregation, and smart meters employed in indirect algorithms for non smart consumers.

Thank you for reading Apache Spark Architecture – Components & Applications Explained. We shall conclude. 

Apache Spark Architecture – Components & Applications Explained- Conclusion

Summing up, Apache Spark is a state of the art data processing and streaming framework. If you are looking for batch analytics or real time streaming workloads, Apache Spark is very handy. It integrates seamlessly with other architectural components to provide smooth functioning. Apache Spark is famous for its in-memory computations. As long as you configure it well, you will enjoy smooth functioning. Besides, you can deploy Apache Spark in multiple languages such as Scala, R, Python, and Java. This makes it easy to use and deploy.

Please take a look at our Apache content in our blog over here

Avatar for Dennis Muvaa
Dennis Muvaa

Dennis is an expert content writer and SEO strategist in cloud technologies such as AWS, Azure, and GCP. He's also experienced in cybersecurity, big data, and AI.

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x